Please enable JavaScript.

Coggle requires JavaScript to display documents.

Distributed System ( shared-nothing systems ), Cloud Computing and…

- - - - weather forecasting
      - molecular dynamics
- - - - one note faults
        
        after the faulty node is repaired, restart from the last checkpoint
        
        simply stop the entire cluster workload
- - - - need to be able serve users with low latency at any time
      - not acceptable to making the service unavailable
    - - supercomputer built from specializer hardware
        
        node quite reliable
        
        nodes communicate through shared memory
        
        remote direct memory access (RDMA)
      - nodes in cloud service built from commodity machines
        
        equivalent performance at lower cost
        
        have higher failure rates
    - - large datacenter
        
        topologies based on IP and thernet
        
        arranged in clostopologies
        
        provide high bisection bandwidth
        
        bisection bandwidth: bandwidth available between the two partitions
      - supercomputer
        
        specialized network topologies
        ( multi-dimensional meshes and toruses )
        
        better performance for HPC workload with known communication patterns
    - - bigger system -> more likely its components is broken
      - broken things get fixed but new things break
        
        resonable to assume that something is always broken
        
        in large systemwhen broken, if simply giving up
        
        time of recovering from faults > doing useful work
    - - very useful for operations and maintenance
      - example: if request to virtual machine A failed, just kill it then request to the new virtual machine B
    - - distributes system
        
        keep data geographically close to user to reduce access latency
        
        communication over internet
        
        slow and unreliable compared to local network
      - supercomputers
        
        all of node are close together
- - - - memory corruption
      - loose connector
    - - total system failure
        
        kernel panic
        
        blue screen of death
        
        failure to start up
    - - refer crash completely
      - returning a wrong result
        
        not recommend
        
        difficult and confusing to deal with
- - - - nondeterministic