LoadSliceCore
Qualities
Through the main queue
All, except Loads, Addr-part of Stores, simple-AGI's
Through the bypass queue
The simple address-generating Instructions (AGI's)
Address part of Stores
Loads
It has a scoreboard, the instructions could finish-up OoO
It's not fully OoO = more energy efficient
The main and the bypass pipelines are
In-order pipeline paths
With a bypass pipeline path
Based on the stall-on-use core
Stall-on-use
Main queue
Bypass Queue
Loads
The address-part of Stores
The simple address generating instructions AGI's
All of the instructions that are not bypassed
The data-part of the Store instructions
are executed using the main queue
This honors the load/store dependencies
The iterative AGI detection
algorithm
Some proposed blocks
IST (Instruction Slice Table)
A cache without any data fields,
just produces a hit/miss bit for addresses
RDT (Reg Dependence Table)
A cache with lines for each register
each line holds the address for the last instruction
that has been written to it.
Also caches the IST hit bits for the each address at each line
Force-bypass all the Loads
Use the algorithm to tag the instructions
Force-bypass the addr-part of the Stores
If an instruction is tagged, send it to the bypass queue
❓
Queue vs. Issuing Instructions
(Two queues, but one issue block?)
Weaknesses
Unrolled Loops
Worse energy eff.
compared to simple
in-order cores
Indirect Mem Accesses
Performance
Without Constraints
In-order < LoadSliceCore < OoO
With some area
and power constraints
better than in-order and OoO
Terms
Stall-on-use
Stall-on-miss
Backward Slices
Iterative tagging of the instructions
might be hard to implement on a single
pipeline stage with 0.2ns access times
for a 2GHz target.
(RDT and IST can have multiple ports)
Register renaming to avoid the fake dependencies
Speculation