Please enable JavaScript.

Coggle requires JavaScript to display documents.

Distributed Systems (Basics (A program (is the code you write.), A process…

- - - - reliability has enormously improved since the 1980’s
      - Today, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures
    - - Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%).
    - - heisenbug
        
        seems to disappear or alter its characteristics when it is observed or researched
        
        tend to be more prevalent in distributed systems than in local systems
      - bohrbug
        
        does not disappear or alter its characteristics when it is researched
        
        typically manifests itself reliably under a well-defined set of conditions
    - - Halting failures
        
        A component simply stops
        
        There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests.
        
        Your computer freezing is a halting failure.
      - Fail-stop
        
        A halting failure with some kind of notification to other components
        
        A network file server telling its clients it is about to go down is a fail-stop.
      - Omission failures
        
        Failure to send/receive messages primarily due to lack of buffering space
        
        which causes a message to be discarded with no notification to either the sender or receiver.
        
        can happen when routers become overloaded.
      - Network failures
        
        A network link breaks.
      - Network partition failure
        
        A network fragments into two or more disjoint sub-networks
        
        within which messages can be sent, but between which messages are lost.
        
        This can occur due to a network failure.
      - Timing failures
        
        A temporal property of the system is violated
        
        For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
      - Byzantine failures
        
        This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.
- - - - each component is continually open to interaction with other components
    - - system can easily be altered to accommodate changes in the number of users, resources and computing entities
- - - - Fault-Tolerant
        
        It can recover from component failures without performing incorrect actions.
      - Highly Available
        
        It can restore operations, permitting it to resume providing services even when some components have failed.
      - Recoverable
        
        Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
      - Consistent
        
        The system can coordinate actions by multiple components often in the presence of concurrency and failure
        
        underlies the ability of a distributed system to act like a non-distributed system.
      - Scalable
        
        It can operate correctly even as some aspect of the system is scaled to a larger size
      - Predictable Performance
        
        The ability to provide desired responsiveness in a timely manner.
      - Secure
        
        The system authenticates access to data and services