LoadSliceCore

Qualities

Through the main queue

All, except Loads, Addr-part of Stores, simple-AGI's

Through the bypass queue

The simple address-generating Instructions (AGI's)

Address part of Stores

Loads

It has a scoreboard, the instructions could finish-up OoO

It's not fully OoO = more energy efficient

The main and the bypass pipelines are

In-order pipeline paths

With a bypass pipeline path

Based on the stall-on-use core

Stall-on-use

Main queue

Bypass Queue

Loads

The address-part of Stores

The simple address generating instructions AGI's

All of the instructions that are not bypassed

The data-part of the Store instructions
are executed using the main queue

This honors the load/store dependencies

The iterative AGI detection
algorithm

Some proposed blocks

IST (Instruction Slice Table)

A cache without any data fields,
just produces a hit/miss bit for addresses

RDT (Reg Dependence Table)

A cache with lines for each register
each line holds the address for the last instruction
that has been written to it.

Also caches the IST hit bits for the each address at each line

Force-bypass all the Loads

Use the algorithm to tag the instructions

Force-bypass the addr-part of the Stores

If an instruction is tagged, send it to the bypass queue

Queue vs. Issuing Instructions
(Two queues, but one issue block?)

Weaknesses

Unrolled Loops

Worse energy eff.
compared to simple
in-order cores

Indirect Mem Accesses

Performance

Without Constraints

In-order < LoadSliceCore < OoO

With some area
and power constraints

better than in-order and OoO

Terms

Stall-on-use

Stall-on-miss

Backward Slices

Iterative tagging of the instructions
might be hard to implement on a single
pipeline stage with 0.2ns access times
for a 2GHz target.
(RDT and IST can have multiple ports)

Register renaming to avoid the fake dependencies

Speculation