Please enable JavaScript.

Coggle requires JavaScript to display documents.

COMP25212 System Architecture - Coggle Diagram

- - - - When reading data you might have to wait for register updates, and memory/disks to locate and fetch data
      - Writing is less critical and can be deffered whilst continuing
    - - Processor registers allow two or three read operations and one or two write operations per cycle
      - Memory provides single access per cycle
      - Disks are really slow in comparison
- - - - Expensive/high bandwidth(/wide?) fast buses
      - Slower I/O systems: ‘rarely’ used -> low performance
- - - - Spacial locality - using memory that is close to each other
      - Temporal locality - sued now means might use later
  - - - tag field notes which locations are currently cached
      - data field which holds what is held in the tagged location
  - - - FA slower but miss less
      - DM faster but miss more often
  - - - each item can be cached in one of two places, reducing conflict, increasing cost, decreasing hit speed
- - - - fast, but invalidated on process context switch hence expensive
      - used for L1 cache
    - - Longer accss latency, may persist over context switching
- - - - checs operation direction, privilege and address and rejections causes an exception
      - Also translates virtual addresses to physical addresses
  - - - Harvard at L1 with multiprocessor sharing at L2
- - - - The latency is much less if it uses an already open column, so we exploit this by using burst access, reading from consecutive addresses
      - It is appropriate for much code behaviour and a match for filling cache lines
    - - Rows and columns are separately addressed
    - - Activate a row, read or write a column of that row, maybe another column etc, and write the active row back to the array
    - - Each action takes time (signal diagram that is outdated)
  - - - Mutiple DRAM arrays within a SDRAM chip
      - The operare independently allowing multiple operations to interleve
      - There may also be Ranks ie chips/modules
    - - Reflects underlying DRAM activity with added clock sequences
    - - Scheduling and interleaving requests from many sources
      - It translates a processo or bus operations into approprate comand sequences
      - Programmed with the RAM's timing characteristics and clock speed
    - - With true RAM it is irrelevant which address bit is used where, but with SDRAM it matters
      - Some lower bits address the column, mapping bursts onto cosecutive addresses
      - Using low-order address bursts for the bank allows the banks to interleave
      - Usin higher-order bits for the back maes sense in a multicore system
    - - When reading birsts you can close the row ready for the next operation or leave it open and hope the next access will be in the same row
      - This has less latency if your assumption is correct but much higher latency otherwise
    - - Data bursts will be address aligned
      - We have discarded words when we don't need the whole line
      - Cache fetches will be naturally aligned
    - - SDRAM must also perform refresh, activating each row and precharging it every few milliseconds
      - It must do interface timing calibration, the controller or physical interface has to synchronise returned data with internal delays, impacting performance randomly
  - - - Bits can be corrupted by radiation
      - You can employ error correction codes (ECC)
    - - Cheap and cheerful error check
      - Whether the number of bits is odd or even, argreed in advnce
    - - Triple Moduar Redundancy
        
        Write each value three times, take the most common
      - Hamming code Codes
        
        Less overhead and used for memory protection
    - - Group data bits in a different pattern, adding check bits to force known parity in each group
      - Assuming a singke bit error, the groups with the wrong parity identify the error position
      - Parity generation and checking is largely XOR
      - Overhead grows with log2 of the word size
- - - - Filing system
      - Page store
    - - Magnetic disk, optical disk, SSD
  - - - Striping data, improved bandwidth with parallel access, but not redundanr with failure more likely and critical
    - - Mirror same data on many disks, higher bandwidth for simultaneous reads
    - - RAID01 s a RAID1 array of RAID0 systems
  - - - Robust and quiet
      - Compact
      - Lower latency
    - - Long term data retention (probably) worse
      - Limited number of write operations (wear out)
- - - - Compilers can reorderm but may still need harware to avoid issuing conflicts
      - If we coul rely on the compiler we could get rid of expensive checking logic, the principle of Very Long Instruction Word (VLIW)
    - - Legacy binaries - optimum code tied to a particular hardware configuration, 'code bloat' in VLIW
    - - An instruction buffer needs to be added to store all issued instructions
      - A dynamic scheduler is in charge of sending nonconflicted instructions to execute
      - Memory and register accesses need to be delayed until all older instructions are finished
    - - This requires instruction dispatching and scheduling and memory and reister accesses to be deferred
  - - - Some functional units may not be pipelined, meaning only one instruction can use them at once
      - If all suitable functional units for executing an instrcution are busy, then it cannot be executed
  - - - the original order is not preserved, and instructions are executed as input data becomes available
      - pipeline stalls due to conflicted instructions are avoided by processing instructions which are able to run immeditey
    - - Cache misses - long wait before finishing excution
      - Structural hazard - required functional unit not available
      - Data hazard - dependencies between instructions
    - - True dependency - read after write
      - Anti-dependency - write after read
      - Output dependency - write after write
    - - Allow instructions behind stall to proceed -> instrcutions executing in parallel
      - This overcomes the limitations of in-order pipelined execution by allowing out-of-order instruction execution
    - - Accelerates the execution of programs
      - More efficient design - increases the utilisation of processor resources
    - - More complex design
      - Expensive in terms of area and power
      - Non- precise interrupts
  - - - Issue - decode instructions and check for structural and WAW hazards
      - Read operands - wait until no data hazards, then read operands
      - Execution - operate on operands
      - Write result - finish execution and write results
    - - Instruction status, functional unit status, register result status
    - - The amount of parallelism available among the instructions
      - Centralized structures are not too scalable
        
        Scales more linearly with the number of score entries and the number and types of functional units
    - - RAW - stall conflicted instruction in the FU
      - WAW - stall the whole pipeline
      - WAR - stall conficted instruction in Write Result stage
  - - - Logic for OOO execution is decentralised
      - Reservation station in the functional units keep instruction information and rename registers
      - A common data bus (CDB) broadcasts data and results to the different devices
      - Distributed control allows for larger window of instructions, more flexible dynamic scheduling
      - Structural hazrds stall the pipeline
      - RS track operands and buffer them when available reducing pressure on registers and the impacr of RAW
      - Execute an instruction when operands available and WAW and WAR dependecies are resolved through register renaming
    - - Data and source come from the bus
      - 64 bits of data + 4 bits of functional unit address, FU broadcasts their results, RS take the operand if it matches any inout FU and register bank takes the operan if it matches the FU writing the result
    - - Issue - get an instuction from FP Op Queue
      - Execute - operate on operands
      - Write result - finish execution
    - - No information about instructions needed
      - Information in the Reservation Station
        
        Operation to perform
        
        Value of source operands
        
        Reservation stations producing source registers
        
        Busy
      - Register result status - indicates which functional unit will write each registeer, if one exists
    - - Distributed hazard detection logic
      - Avoids stalling due to WAW and WAR hazards
    - - Complexity of hardware
      - Performnce limited by the Common Data Bus
- - - - presents each harware thread as a virtual processor
      - requires multiprocessor support from the OS
      - Needs to replicate registers and share caches, but this can cause thrashing issues
  - - - fetch and prioritisation policies
      - allocation policies
      - how to measure performance
      - where to fetch from
    - - Asymmetric pipeline stall - A pipeline may be stalled without need
      - Overtaking - can't allow instructions to overtake there predeccesors if stalled
      - Cache misses
      - Most existing implementations are for OoO, register-renamed architectures
  - - - Spawn two paths when reaching a conditional branch, kill the false branch we we know the correct one
    - - Compile applications into two threads, with one being for scouting ahead to fetch memory in advance
    - - Compile sequential applications into two threads, with one running the critical path and a slipstream to run ahead and pass results
    - - When reaching a block of code that requires reliable computation, we can replicare it, compare the result and ensure the result is correct or re-do it
- - - - Put multiple cores on a single chip, used in parallel. Simpler to design than a more complex processor
  - - - Fence - all memory accesses before the dence need to complete before the ones after
      - Barrier - all threads need to reach the barrier before any of them can continue execution
      - Lock - only one thread can procedd to the atomic section
  - - - Atomic compare and swap instructions
      - A single instruction - cannot interleave reading and writing
      - Load-linked and store-conditional instructions - hardware locks cache line and checks if any instruction has modified data in the cache line before storing
      - Transactional Memory - execution assuming no conflics and check afterwards
  - - - snooping protocols - all cached are aware of everything happening in the system
      - directory protocols - cache line information is stored in a seperate subsystem and chnages are updated there
  - - - Knowing if a cached value is not shared can avoud sending messages, impoving latency, lowering bandwidth consumption and decreasing contention for bus usage
      - Invalidate description assumes write through as copy back could fetch incorrect value
  - - - Modified - only cached copy and dirty
      - Exclusive - coherent with main and is the only cached copy
      - Shared - coherent with main but is not the only cached copy
      - Invalid - out of date
  - - - Modified - only cached copy and dirty
      - Owner - cache line is modified and different from main memory, there can be cached copies in S state
      - Exclusive - coherent with main memory and the only cached copy
      - Shared - coherent with main memory bu copies exist in other caches or there is a copy in O state
      - Invalid - invalid data
- - - - They must ensure consistency and coherency
    - - Have independent processor/store pairs, so no coherence is granted at the processor level saving chip area
      - Communication/synchronisation is introducd explcitly in the code through message passing
  - - - Moving data is very expensive and needs to be done to start execution
      - The memory accessed by all the cores within an array needs to be consecutive
      - All the cores execute the same instruction
- - - - Get instructions from memory
      - Decode instruction and select registers
    - - Perform operation or calculate addres
      - Access an operand in data memory
      - Write result to a register
  - - - less per pipeline
      - each step takes less time
      - clock frequency increases
      - But
        
        greater penalty for hazards
        
        more likely to have conflicting instructions
        
        more complex control
      - Hence we must trade off between frequency, power and area