Please enable JavaScript.
Coggle requires JavaScript to display documents.
COMP25212 System Architecture - Coggle Diagram
COMP25212 System Architecture
Introduction
Moore's Law
Smaller transistors means more and faster transistors
This means more complexity
Some avenues are restricted by physical limits such as the speed of light and power dissipation
It is limited by cost and hence slowing down
The three box model is the processor, memory and IO with buses running between them.
Address space(s)
Storage is a hierarchy
Registers
Memory - layered in cache levels
I/O space
Accelerators
Secondary storage
Latency and Bandwidth
Latency - delay from asking until receiving (measured in time)
When reading data you might have to wait for register updates, and memory/disks to locate and fetch data
Writing is less critical and can be deffered whilst continuing
Bandwidth - rate of delivery (measured in bytes per time)
Processor registers allow two or three read operations and one or two write operations per cycle
Memory provides single access per cycle
Disks are really slow in comparison
Memory
You can prirotise higher speed and a smaller memory or higher capacity and a slower memory
Amdahl's Second Law
A balanced computer should have 1MIPS processing per 1MB memory per 1Mb/s I/O
Input and Output
Mostly interfaced as memory mapped I/O with devices in main address space
Devices not as 'simple as memory'
I/O fairly slow
Economics
Want the max 'value' -computations- at min 'cost' - capital cost/energy. Depends on software behaviour
Compromise
Programmable machines do different tasks
Hardware is typically fixed
Architectures designed for particular behaviour
Interconnection
Bus
the logical bundle of stuff between processors and memory
Interconnection
The devices in a system architectures are connected with buses: shared and switched buses
Single bus: simple, scales easily, but bottlenecks
Crossbar is any set of connections providing no collisions: offers parallelism but scales badly
Network on Chip
Often a rectangular grid pattern which suits a cilicon layout
Bytes and Words
Byte usually smallest addressable unit
Word size varies - usually the natural data size in an architecture, increasingly 32 bits
Width of buses may be narrower
Alignment
Misaligned operations will be slower, or malfunction, so we must keep data at appropriate address boundaries
Code Alignment
enforced on some architectured e.g. RISC
Some ISAs have variable length operations, making alignment infeasible, but compilers try by dividing up words after they are fetched and using NOPs to allow loops to be done more efficiently.
Endianness
Little/big endian: least/most significant at the lowest address
Little endian is more common atm
Von Neumann and other flavours
Most processors are Von Neumann at the software level but not always at the architecura
Harvard architcetures have separate address spaces
Commonly harvard interface to the processor, Von Neumann for the programmer
Memory has a 'soft' partition
System bus organisation
Hierarchy of memory and buses
Expensive/high bandwidth(/wide?) fast buses
Slower I/O systems: ‘rarely’ used -> low performance
Bus-carried information
Operation - NOP, data write/read, instruction fetch
Address - may include byte selection
Data - both directions, maybe separate channels
Privilege - space access
Ready - after waiting for memory
Abort - sometimes the transaction is refused
Caches
Principles
Speed up a small subset of accesses assuming the address space is used unevenly over time
We rely on typical behaviour and locality
Spacial locality - using memory that is close to each other
Temporal locality - sued now means might use later
Caches exploit this, putting things in use in fast memort
The working set - set of things that are being used here and now
Hit and Miss
To estimate average memory system performance: Tave = Thit + Pmiss x Tmiss
Technology dictates
Big memories are slow
Small memories are fast
SRAM is good for random access and fuss-free
SDRAM is good for big memories with burst access but is difficult to use
Operations
Cache compromises two parts
tag field notes which locations are currently cached
data field which holds what is held in the tagged location
Address is checked in tags and if recognised, corresponding data is read
If address is not recognised, address sent to memory and there is a memory read
Fully Associative Cache
N-entry cache can cache 'N' items without restriction, this is fully associative
This is also expensive as the entire table needs searching
Content Addressable Memory tells where content is stored and if it is found, it is read
Direct-mapped Cache
Items have a predetermined place where it may be cached
Incoming addresss only need one tag comparison
Data can be looked up speculatively, in parallel, rather than waiting for comparisons to resolve
Cache misses
Compulsory misses - data ws not previously used
Capacity misses - cache is full and more space needed
Conflict misses - architecture limits tag flexibility, cannot occur in fully associative
Every miss imposes a perfomance penalty
FA slower but miss less
DM faster but miss more often
Set-associative caches
Two direct mapped caches in parallel
each item can be cached in one of two places, reducing conflict, increasing cost, decreasing hit speed
Extenisble to more sets (2, 4, 8-way cache), extending enough will make it fully associative
Cache lines
Spatial locality exploitation - store several consecutive words with a single tag, reducing tag overhead
Typical line size (memory cache): {4, 8} words
Choice
Archietcture is compromise
Size, Expected use, Associativity, Power
Cache Replacement Policy
For DM, there is no choice
Round Robin - easy to implement, can produce surprises
LRU - hard to implement fast, in hardware: good results
Random - easy to implement: reasonably effective
Cache writes
As cache is a copy, writing changes the data
You can modify the cache and memory (write through) or just the cache (leaving outdated memory)
Write through
Write to cache and memory, slow and an address may be written repeatedly wasting energy
Means cached copy can be disacardded at any time, is simple
Copy back (write back)
Write to cache only, faster and lower power
Memory becomes out of date and need to copy the cache data back on eviction, efficient but complicated
Copy back - flushing
Sometimes it is important the real memory is up to date
A cache flush is needed when something critical changes, which is expensive to write back
Write misses
Not uncommon - can store something before loading it
You can load the cache line first then modify the stored content or write through
Former is used because spatial locality suggests caching
Write Buffers
Write Buffer
Bandwidth can be increased by wider musses
Latency is trickier, but you dont need to wait: loads depend on returned data, stores don't
Write operations may be 'fire and forget' - processor may proceed
Read operations may overtake queued writes as long as we forward any data yet unwritten
Reads impose latency, writes do not
FIFO queue for operations with forwarding as needed, fine for most memory (not IO) and orderng must be enforced in some situations e.g before large memory loads
Ordering operations
Disallow buffering - pages can be marked to force writes to complete before processor can continue: a stall
Memory barrier (fence) - all preceding operations must complete before any succeeding operations start
Victim cache
Exploit write buffer after writes despatched, keep forwarding values after writing
Alleviates 'unfortunate' choice of rejected line
Cache coherency
Processor might re-write its own code and multiprocessors might interfere with data
Atomic operations
With multiple processors sharing a bus we need atomic operations and similar transactions "Read-Modify-Write"
The bus must support this through more signals and wires
Cache Prefetch
Used in predictable, regular access patterns
Simulate a miss withut wanting the data with a software 'hint' instruction or hardware prediction (smart pattern recognising logic)
Cache location
Cache entires tagges by address
Virtual addressed cache
fast, but invalidated on process context switch hence expensive
used for L1 cache
Physical address cache
Longer accss latency, may persist over context switching
Memory Management
Virtual memory
Each process has a private address space, which are translated to the machine's physical addresses defined in a page table
Translation simplified by assuming locality
Multi-level page tables
Each access is looked up in a table in memory which points at a rable in memory, looked up to find a table in memore etc.
Translations can be cached when first made
This cache is the Translation Look-aside Buffer
Table Walking
The TLB has the page transaltions and permissions
A miss stalls the processor, reads the physical memory, caches the tranlation and instigates the L1 cache access i.e. table walking
Memory Management Unit (MMU)
Protects memory from unauthorised operations
checs operation direction, privilege and address and rejections causes an exception
Also translates virtual addresses to physical addresses
Also checks whether an access is allowed and permission for the mmeory cycle
Multiple levels of cache
We can have virtual and physical tags
Unifications
Harvard at L1 with multiprocessor sharing at L2
Larger and slower as hierarchy descended
Main Memory
RAM
Any random address will get the same treatment as any other (constant time)
Normally compared with 'Serial access' such as magnetic tape
Static RAM
Also randomly accessible, compact, low power, volatile, extensivey used
RAM Layout Implications
BIgger memories need longer wires so are slower and more power hungry
RAM Layout
Contructed in close to square blocks, long and thin
Aspect changed by using core words wide and multiplexing
SRAM applications
Working RAM in small processor systems
Cache memories in big processor systems
Buffer memories in interfaces
Temporary storage in hardware etc.
Multi-port RAM
SRAM macrocells with dual or more buses capable of simultaneous access are feasible
Cost is high, and often not worthwhile
Content Addressable Memory (CAM)
Also known as associative memory
Unlike RAM, data foes in and address comes out (almost)
Not part of the main memory hierarchy and there is no guarantee of finding the data, but it could be found in several places: prioritse/encode
Dynamic RAM
Not completely randomly accesible
Compact with moderate power needs
Volatile and needs to be continuously refreshed
SDRAM is DRAM
The latency is much less if it uses an already open column, so we exploit this by using burst access, reading from consecutive addresses
It is appropriate for much code behaviour and a match for filling cache lines
Structure
Rows and columns are separately addressed
Access
Activate a row, read or write a column of that row, maybe another column etc, and write the active row back to the array
Timing
Each action takes time (signal diagram that is outdated)
Synchronous DRAM
Modern way to interface RAM using a wrapper to make timing/interfacing easier and providing a high bandwidth
Still has long latency and scheduling issues
Banks
Mutiple DRAM arrays within a SDRAM chip
The operare independently allowing multiple operations to interleve
There may also be Ranks ie chips/modules
Timing
Reflects underlying DRAM activity with added clock sequences
Controller
Scheduling and interleaving requests from many sources
It translates a processo or bus operations into approprate comand sequences
Programmed with the RAM's timing characteristics and clock speed
Addressing
With true RAM it is irrelevant which address bit is used where, but with SDRAM it matters
Some lower bits address the column, mapping bursts onto cosecutive addresses
Using low-order address bursts for the bank allows the banks to interleave
Usin higher-order bits for the back maes sense in a multicore system
Line closure
When reading birsts you can close the row ready for the next operation or leave it open and hope the next access will be in the same row
This has less latency if your assumption is correct but much higher latency otherwise
Address alignment
Data bursts will be address aligned
We have discarded words when we don't need the whole line
Cache fetches will be naturally aligned
Miscellanea
SDRAM must also perform refresh, activating each row and precharging it every few milliseconds
It must do interface timing calibration, the controller or physical interface has to synchronise returned data with internal delays, impacting performance randomly
Signal Timing
PCB-level interface can impose latency, frequency is guaranteed while latency is uncertain and the controller resynchronises incoming data so that skew stays small
Memory Reliability
Memory faliures are caused by high enerygy particles whacking through the pathways in silicon causing an ionisation track, causing temporary short circuits that cause bits to flip
Soft errors
Bits can be corrupted by radiation
You can employ error correction codes (ECC)
Parity
Cheap and cheerful error check
Whether the number of bits is odd or even, argreed in advnce
ECC
Triple Moduar Redundancy
Write each value three times, take the most common
Hamming code Codes
Less overhead and used for memory protection
Hamming Codes
Group data bits in a different pattern, adding check bits to force known parity in each group
Assuming a singke bit error, the groups with the wrong parity identify the error position
Parity generation and checking is largely XOR
Overhead grows with log2 of the word size
Secondary Storage
Secondary Storage
Main uses
Filing system
Page store
Outside the main address spaces
Various technologies
Magnetic disk, optical disk, SSD
Disks
Magnetic disks usually in stacks, use both sides, read using an arm with magnetic read/write heads moving radially
Bits are encoded by polarisation of domains
Data storage is organised in blocks (up to 20TB)
Latency is the time to move the heads to the right track and depends on the start an end points - on average half a rotation time
Bandwidth depends on the disk rotation speed and areal density and is much reduced by fragmented placement
Disk performance
Quantified by the latency and the file size divided by the data rate
Fault tolerance
Technology is not 100% reliable
CRC was used as a clock code for error detection, but more recently ECC used to accept and correct faulty bits
When tht is insufficient, bad sectors are remapped so some parts of the surface are not used
RAID (Redundant Array of Inexpensive Disks)
Improve reliablity can be achieved by spending more on higher quality parts or usinf reducndancy for fault tolerance with cheaper parts with RAID aimed at the latter
RAID 0
Striping data, improved bandwidth with parallel access, but not redundanr with failure more likely and critical
RAID 1
Mirror same data on many disks, higher bandwidth for simultaneous reads
RAID 2 - stripe bits with added Hamming code
RAID 3 - byte stripes: parity
RAID 4 to etc - similar stuff
Nested RAID - hybrid systems
RAID01 s a RAID1 array of RAID0 systems
Interconnection
Typically connected via Serial AT Attachment
Bus interface - carried serially
Disk caches
Disk buffer can save pages and act as a write buffer
SSDs
Permanent, non-volatile semiconductor store, typcally flash memory
Advantages
Robust and quiet
Compact
Lower latency
Disadvanatges
Long term data retention (probably) worse
Limited number of write operations (wear out)
SSD may be multiple devices and will hava a contrller to deal with interface protocols
It could contain volatile RAM for buffering or caching purposes.
Flash Memory
Can only wrte bits in one sense and writes are slow
Can only erase whole blocks, also slow
Write operations cause damage, leveling spreads activity evenly across devices and is random acces so fragmentation is not an issue
Principles
Increasing bandwidth/throughput
Parallelism adds more units: busses, lanes, disks, superscalar
The cost is more wires, pins, space
Interleaving
In a large FIFO buffer each SRAM block is single ported, but looks dual ported if addresses are interleaved at word level
Caching
Keeping a coby of a subset of data in a fast local store, works well for typical behaviour
Applications include computer main memory, page translation, virtual memory pages etc
Speculation
Instruction prefetch relies on most instructions being sequential and is improved by branch prediction
Typically associated with pipelining
You can be lazy - don't start something unitl its known to be needed, efficient in effort but long latency
You can be eager - speculate on what may be needed in future which reduces latency but wastes effort
Synchronisation
Important to give the appearance of keeping everything in order
Important in keeping cache coherency - updating own instructtions
Need a reliable means of sycnhronising parallel threads
Error detection/correction
Memory may use parity or block ECC
Disk blocks, netwrok packets use CRC (Cyclic Redundancy Check)
Data streams use codings which correct burst errors
Profiling
We optimise common features
ISAs include common and cheap operations exludin expensive ones unless there is a powerful incentive
Profiling is aided by embedded hardware counters and can reveal aspect of system behaviour e.e.g instructions executed, cache misses etc.
Benchmarks
Can profile architecture for a specific application
Code is profiled using agreed 'representative' benchmarks
Suites of typical programs are maintained by Standard Performance Evaluation Corporation, which allow compariosns of procesor/memory/compiler combinations
Coprocessors
Mechanisms for extending an instruction set
Expensive in hardware and controlle by usual instruction stream
DMA Controller
Programmed by processot then moves data 'by bus' autonomously
Processor caches reduce bus contention, interrupts on completion
Controller works in physical address space and is used for I/O and memory-memory sometimes
Out-of-order Processors
Recap
Pipelining - better utilisation of the hardware by allowing different parts of different inststructions can be done at the same time
Control hazard - multiple instructions requiring the same data, so the control of those instructions is difficult
Data hazard - data dependency between instructions that could result in outdated data being used, mitigated by extra lines in the data path, NOPs or reordering instructions
Superscalar processor - processor that is able to exeute more than 1 instruction in parallel, requiring extra execution lanes and increased number of ports to memory and instruction bank. Scalability limits are the complexity of hardware increasing with each extra lane and applications may not exhibit enough parallelism
Functional unit - a hardware component of a processor which can perform a specific operation
Structural hazard - when an instrcution cannot be issed because all suitable functional units are busy
Reordering instructions
Compiler optimisation
Compilers can reorderm but may still need harware to avoid issuing conflicts
If we coul rely on the compiler we could get rid of expensive checking logic, the principle of Very Long Instruction Word (VLIW)
Compiler limitations
Legacy binaries - optimum code tied to a particular hardware configuration, 'code bloat' in VLIW
Out of order processors
An instruction buffer needs to be added to store all issued instructions
A dynamic scheduler is in charge of sending nonconflicted instructions to execute
Memory and register accesses need to be delayed until all older instructions are finished
Out of Order Execution
This requires instruction dispatching and scheduling and memory and reister accesses to be deferred
Modern Processor Architecture
Modern Pipelines have many execution flows, and only some are pipeline-able (shown by buffers)
Structural Hazards
Some functional units may not be pipelined, meaning only one instruction can use them at once
If all suitable functional units for executing an instrcution are busy, then it cannot be executed
Out-of-order Processors
Out of order execution
the original order is not preserved, and instructions are executed as input data becomes available
pipeline stalls due to conflicted instructions are avoided by processing instructions which are able to run immeditey
Conflicted instructions
Cache misses - long wait before finishing excution
Structural hazard - required functional unit not available
Data hazard - dependencies between instructions
More complex data dependencies
True dependency - read after write
Anti-dependency - write after read
Output dependency - write after write
Dynamic Scheduling
Allow instructions behind stall to proceed -> instrcutions executing in parallel
This overcomes the limitations of in-order pipelined execution by allowing out-of-order instruction execution
Benefits
Accelerates the execution of programs
More efficient design - increases the utilisation of processor resources
Limitations
More complex design
Expensive in terms of area and power
Non- precise interrupts
Scoreboarding
The scoreboard is a centralized hardware mechanism, instructions are executed when their operands are available and there are no hazards
Hardwate constructs the dependency graph for a window of instructions an the scoreboard provides the information necessary for the processor to work together
This divides the instruction decode stage into issue and read operands and uses in-order issue and out of ourder execution and commit
Stages of a Scoreboard Pipeline
Issue - decode instructions and check for structural and WAW hazards
Read operands - wait until no data hazards, then read operands
Execution - operate on operands
Write result - finish execution and write results
Information within the scoreboard
Instruction status, functional unit status, register result status
Limitations of Scoreboard
The amount of parallelism available among the instructions
Centralized structures are not too scalable
Scales more linearly with the number of score entries and the number and types of functional units
Dealing with dependencies
RAW - stall conflicted instruction in the FU
WAW - stall the whole pipeline
WAR - stall conficted instruction in Write Result stage
Tomasulo
Algorithm
Logic for OOO execution is decentralised
Reservation station in the functional units keep instruction information and rename registers
A common data bus (CDB) broadcasts data and results to the different devices
Distributed control allows for larger window of instructions, more flexible dynamic scheduling
Structural hazrds stall the pipeline
RS track operands and buffer them when available reducing pressure on registers and the impacr of RAW
Execute an instruction when operands available and WAW and WAR dependecies are resolved through register renaming
Register renaming can be done by compiler, but Tomasulo does it in hardware
Common Data Bus
Data and source come from the bus
64 bits of data + 4 bits of functional unit address, FU broadcasts their results, RS take the operand if it matches any inout FU and register bank takes the operan if it matches the FU writing the result
Stages of Tomasulo Algorithm
Issue - get an instuction from FP Op Queue
Execute - operate on operands
Write result - finish execution
Reservation Station Components
No information about instructions needed
Information in the Reservation Station
Operation to perform
Value of source operands
Reservation stations producing source registers
Busy
Register result status - indicates which functional unit will write each registeer, if one exists
Advantages
Distributed hazard detection logic
Avoids stalling due to WAW and WAR hazards
Disadvantages
Complexity of hardware
Performnce limited by the Common Data Bus
Scoreboard vs Tomasulo
Window size: <= 5 vs <= 14
Structural hazard: stall pipeline vs no issue
WAR dependency: stall completion vs renaming avoids
WAW dependency: stall pipeline vs renaming avoids
Results forwarding: write/read registers vs broadcast from FU
Control structure: central scoreboard vs distributed reservation stations
Multi-threading
Increasing processor performance
minimizing memory access impact (caches) by increasing clock frequency (pipelining with branch prediction and forwarding)
Increasing parallelism
This is limited by the programs as some areas exhibit great parallelism and others are sequential
We can find independent instructions in a different process
Hardware multithreading allows several threads to share a single processor
Software Multithreading ie Multitasking
Modern OS support serveral processes to be run concurrently by being interleaved/scheduled
OS Thread Switching - context switching is done every few ms so seems to run parallel
We use hardware multithreading because the loading of the state is quite slow
Process Control Block (PCB)
Store information about the state of 'alive' processes handled by the OS
Process ID, Process State, PC, Stack Pointer, General Registers, Memory Management Info etc
Hardware Multithreading
Exploits instruction level parallelism
Allows multiple threads to share a single processor, more efficient and fast thread switching
Requires replicating the HW that stores the independent state of each thread - Registers, Program Counter, TLB
Virtual memory can be used to share memory among threads
This needs multiple PCs and register banks and extra address translation capabilities
Decisions
presents each harware thread as a virtual processor
requires multiprocessor support from the OS
Needs to replicate registers and share caches, but this can cause thrashing issues
Coarse-grained Multithreading
Issues instructions from a single thread, operating like a simple pipeline
Switches on an expensive operation or ater a Quantum of execution
Good to compensate for infrequent, but expensive pipeline disruption
Minimal pipeline changes - need to abort all instructions in shadow of Dcache miss -> overhead and resuem instruction stream to recover
Short stalls are not solved, and requries a fast thread switching mechanism
Fine-grained Multithreading
Overlap in time the execution of several threads, using round robin among all the 'ready' hardware threads
Requires instantaneous thread switching but helps alleviate fine grained dependencies
An i-cache miss is overcome transparently
A d-cache miss has the thread marked as not ready
In an OOO processor, we may continue issuing instructions from both threads
Utlisation of pipeline resources increased, wach thread percieves it is bein executed slower, but overall perfomance is better
Simultaneous Multithreading
Exploits ILP and thread level parallelism at the same time
We schedule as many ready instructions as possible, making reading and writing more complex
Design Considerations
fetch and prioritisation policies
allocation policies
how to measure performance
where to fetch from
Issues with in-order processors
Asymmetric pipeline stall - A pipeline may be stalled without need
Overtaking - can't allow instructions to overtake there predeccesors if stalled
Cache misses
Most existing implementations are for OoO, register-renamed architectures
Has significant hardware overhead
Advanced uses of multithreading
Speculative execution
Spawn two paths when reaching a conditional branch, kill the false branch we we know the correct one
Memory prefetching
Compile applications into two threads, with one being for scouting ahead to fetch memory in advance
Slipstreaming
Compile sequential applications into two threads, with one running the critical path and a slipstream to run ahead and pass results
Transient Fault Detection
When reaching a block of code that requires reliable computation, we can replicare it, compare the result and ensure the result is correct or re-do it
Multi-core
One of the solutions to negating the limitations in single thread performance and power has been to intergrate more cores
Improving single-core performance at the architecture level
Caches minimise memory access impace
Pipelines and superscalar increase the amount of parallelism supported by the processor
OoO processing and multithreading increase the amount of independent instructions, with other methods minimising the impact of hazards
Slow down of single processor architecture
Power density is increasing, so cooling is an issue
Limited parallelism within applications
Memory does not get faster at the same rate as processors
Architectural innovation hitting design complexity problems
Smaller transistors have less preictable characteristics
The Power Wall
Power dissipation depends on clock rate, capacitative load and voltage: P = CL x V^2 x F
Decreasing voltage reduces dynamic power consumption but increases the static power leakage from transistors
We have reached the practical power limit for cooling commodity microprocessors
The ILP Wall
The implicit parallelism between instructions in a thread is limited in many applictions
Hardware imposes limitations - number of instructions that can be fetched per cycle, instruction window size
This reduces the effectiveness of all mechanisms
Amdhal's Law
Estimates a parallel system maximum performance based on the available parallelism of an applications - meant to discourage parallelism
speed up = (fraction of the code which is serial + fraction of the code that is parallel (1))/ (fraction of code which is serial + (fraction of code which can be parallel/number of processors))
Reformulated to show that the fraction of serial code is normally constant while the fraction of parallel code depends on the size of input data, for more parallelism increase your dataset
The Memory Wall
CPU speed was increasing over 50% oer year, while memory speed only increased by 10%
Fetching data limits processor utilisation
Solution is replication
Put multiple cores on a single chip, used in parallel. Simpler to design than a more complex processor
How to put them together
Have independent processor/memory pairs
A small number of cores can be used for separate tasks, but for a single task to be improved we need parallel programming with shared memory, message passing, independent threads etc
Problems
We do not know how to enineer extensible memory systems or write general purpose parallel programs
Applications on Multi-cores
Independent processes usually do not share data as they have separate virtual memory spaces, but communication is explicit through I/O
Threads have issues as we have to ensure memory coherency - changes seen everywhere - and consistency - correct ordering of memory accesses
Memory Coherence
To ensure this the core may either write through to main memory or copy back only when cache line is rejected
We must make sure that every cache with a copy of the data is updated which is expensive
Memory Consistency
The model presented to the programmer of when changes are seen
Multiple copies are consistent if a read operation return the same value fom all copies and a write operation updates all copies beofre any other operation takes palce
Sequential Consistency
Memory operations appear to execute one at a time, all thread see write operations in the same order and operations within a single thread appear to execute in the order described
This is what programmers expect, but is not the most stringent memory model
The compiler has to insert special instructions to maintain program semantics
Fence - all memory accesses before the dence need to complete before the ones after
Barrier - all threads need to reach the barrier before any of them can continue execution
Lock - only one thread can procedd to the atomic section
Synchronisation
Locks are a synchronisation mechanism that other are based on as they ensure expected behavious
ISA support for synchronisation
Atomic compare and swap instructions
A single instruction - cannot interleave reading and writing
Load-linked and store-conditional instructions - hardware locks cache line and checks if any instruction has modified data in the cache line before storing
Transactional Memory - execution assuming no conflics and check afterwards
Coherence systems
Information required includes where the other copies of data are, what to do when a value changes and how to deal with shared data
We share this data through
snooping protocols - all cached are aware of everything happening in the system
directory protocols - cache line information is stored in a seperate subsystem and chnages are updated there
Snooping protocols
Each cache snoops for activity concerned with its cached addresses, which requires a boradcast communications structure
Write update - a core wanting to write grabs a bus cycle and broadcasts address and new data as it updates its own copy, snooping caches update their copy
Write invalidate - a core anting to write grabs a bus cycle and a write invalidate message which contains the address and any shared read in other cores gets a miss and the value from the updated cache or main memory. This uses significanyy less bandwidth
Implementation issues
Knowing if a cached value is not shared can avoud sending messages, impoving latency, lowering bandwidth consumption and decreasing contention for bus usage
Invalidate description assumes write through as copy back could fetch incorrect value
MESI Protocol
Multicore invalidate protocol that minimises bus usage and implements copyback, writing back to memeory when a dirty cache line is displaced
Cache line is in one of 4 states
Modified - only cached copy and dirty
Exclusive - coherent with main and is the only cached copy
Shared - coherent with main but is not the only cached copy
Invalid - out of date
MOESI Protocol
Avoids the need to copying back to main memory when shared
Modified - only cached copy and dirty
Owner - cache line is modified and different from main memory, there can be cached copies in S state
Exclusive - coherent with main memory and the only cached copy
Shared - coherent with main memory bu copies exist in other caches or there is a copy in O state
Invalid - invalid data
Beyond Multicore
Snoopy vs Directory
Global view vs local view + directory
Minmal info vs extra for directory and remote shared lines
Centralized communication vs parallel communication
Poor scalability vs better scalability
False sharing
pathological behaviour when two unrelated variables are stored in the same cache line; if they are written by two different cores often, they will generate lot of invalidate/update traffic
The need for networks
All multicore systems must contain means for cores to communicate with memory and each other
Shared-memory applications
They must ensure consistency and coherency
Distributed-memory applications
Have independent processor/store pairs, so no coherence is granted at the processor level saving chip area
Communication/synchronisation is introducd explcitly in the code through message passing
Evaluating networks
This is done through bandwidth, latency, congestion, fault tolerance ,area and power dissipation
Important features of Network on Chips
Topology - how cores and networking elements are connected together
Routing - how traffic moves through the topology
Switching - how traffic moves from one component to the next
Topology
Bus - common wire interconnection, single usage point at a time, controlled by clock, often split transaction, main scalabiity issue is throughput
Crossbar - connect N inputs to N inputs, can achieve any to any in parallel, area and power scale quadratically to number of nodes hence, not scalable
Tree - variable bandwidth and latency and reliability is an issue (There is also a Fat Tree)
Ring - simple but low bandwidth, variable latency
Mesh/Grid - reasonable bandwidth, variable latency, convenient for large systems physical layout
Routing
Length of routes - minimal routing takes the shortest path but is likely to be blocked, while non-minimal routing allows packets to be diverted, but there is a risk of livelock
Oblivious rotuing - unaware of the network state that can take non/determinstic paths leaving a simpler router that is deadlock free but is prone to contention
Adaptive routing - aware of the network state so there is higher performance, but router instrumentation is required and is deadlock prone
Switching
Packet switching - data is split into small packets with info to identify the data
Store and forward switching - a packet is not forwarded until all its phits arrive to each intermediate node, has failure detection, but low performance
Cut-through/Wormhole switching - a packet can be forwarded as soon as the head arrives to an intermediate node, which has better performance by fault detection is only possible at the destination
Clusters, Datacentres and Super Computers
Have lots of CPUs on lots of motherboards
High perfromance computing - run one large task as quickly as possible
High throughput computing - run as many tasks per unit fo time as possible
Big Data Analytics - analyse and extract patterns from large, complex data sets
Building a Cluster, Supercomputer or Datacentre
Lots of computers, optimised for cooling and power efficiency, high redundancy for fault tolerance, join lots of compute racks, add a netwrok, power distribution, cooling and dedicated storage
Single Instruction Multiple Data
Vector processing - perform the same operation over many data at the same time, reducing instruction memory pressure
Performance limiting issues are conditionals, loop dependencies, non-sequential data, memory indirection
General Purpose computing using GPUs (GPGPU)
GPU architechture is based on arrays of cores executing the same instruction over different data
Limitations are
Moving data is very expensive and needs to be done to start execution
The memory accessed by all the cores within an array needs to be consecutive
All the cores execute the same instruction
Pipelining
Types of instructions
Memory instruction - move data between memroy and registers
Processing operations - perform calculations with values in registers
Control flow instructions - take decisions, repeat operations
Fetch-Execute
Fetch
Get instructions from memory
Decode instruction and select registers
Execute
Perform operation or calculate addres
Access an operand in data memory
Write result to a register
Cycles of operation
Most logic circuits are driven by a clock
In its simplest form, one instructon would take one clock cycle
This means that with each clock/instruction, each execution is only working 1/5th of the time
So that we can optimise this we add buffers between stages so that the stage can start working on something else, increasinf the clock frequency by 5x
Limits to pipeline scalabiltiy
Higher frequency -> more power
More stages means more hardware, more complex design, more difficult to split into unifom chunks, and the loading time of the registers imits cycle period
A longer datapath means higher probability of hazards occuring and worse penalties when they happen
The Control Transfer Problem
As instructions are normally fetch sequntially, fetching a branch is only known when we decode it in the second stage and we will have already fetching the next instruction, leaving us with a bubble in the pipeline
For conditional branches this is worse because we cannot determine its outcome until the third stage, so we have two bubbles
This can be avoided by reading registers during the decode stage
Deeper Pipelines
Bubbles due to branches are called Control Hazards
We can go around this by using branch prediction, as in most programs many branch instructions are executed any time
We can take note of its address and the target and use this the next time the branch is fetched
We can go to longer pipelines
less per pipeline
each step takes less time
clock frequency increases
But
greater penalty for hazards
more likely to have conflicting instructions
more complex control
Hence we must trade off between frequency, power and area
Branch Target Buffer
A small cache that we use to store branch addresses and their target addresses
For uncoditional branches it is always correct, but for conditional branches, it deoends on the probability of repeating the target
Other Branch Prediction
BTB is simple to understand but expensive to implement
Prediction accuracy depends on more history and context
Data Hazards
If there is a data dependency between two instructions, and the data is not yet ready before the second instuction is ready to execute this is a data hazard.
We can deal with this by detecting dependencies in hardware and holding instructions in the decode stage until data is ready ie pipeline stall
This causes bubbles and wasted cycles again
The compiler can also try to reorder instructions (if there is something useful to do) or insert NOPs
Forwarding
When data is ready after the eecute stage, it can be forwarded to the execute stage of the instruction requiring the data
This makes the control very complex
Instruction Level Parallelism
When we can execute some parts of the program together as long as there are few dependencies
Real programs often have lots of independent instructions
We can exploit this by fetching multipe instructions per cycle, decoding then, requiring multiple ALUs for execution, using common registers and accessing common data cache
With a dual issue pipeline, two instructions can be executed at the same time, possibly doubling the execution rate
Access to registers and cache will be doubled so we may need them to be dual ported as well
However, to get this performance, we need independent instructions: we have a 'dispatch unit' in the fetch stahe which uses hardware to examine instruction dependecies and only issues 2 in parallel if they are independent
Data dependencies limit our opportunities for exploiting parallelism, but we can examine these and reorder, we can execute pairs in parallel
Limits of ILP
Modern processors are up to 4-way superscalar, but rarely achieve 4x speed
There is a lot of hardware complexity and limited amounts of ILP in real programs due to assumption of serial execution