Please enable JavaScript.

Coggle requires JavaScript to display documents.

Availability (Design checklist for availability (Mapping among…

- - - - Log the fault
      - Notify appropriate people or system
      - Disable to source of events causing the fault
      - Fix or mask the fault/failure
      - Operate in degraded mode
  - - - Ensure coordination model can detect omission, crash, incorrect timing, or incorrect response
      - Ensure the coordination model is enable the logging of the fault, notification of appropriate entities, disabling of the source of the events causing the fault, fixing or masking the fault, or operating in a degraded mode
      - Ensure that the coordination models support the replacement of the artefacts used (processors, communications channels, persistent storage, and processes).
- - - - Omission -> A component fails to respond to an input
      - Crash -> the component repeatedly suffer omission faults
      - Timing -> A component responds but the response is early or late
      - Response -> A component responds with an incorrect value
- - - - Reintroduction is where a failed component is reintroduced after it has been corrected.
      - 2 tactics
        
        Shadow tactic
        
        The shadow tactic refers to operating a previously failed or in-service upgraded component in a "shadow mode" for a pre-defined duration of time prior to reverting the component back to an active role. During this duration its behavior can be monitored for correctness and it can repopulate its state incrementally.
        
        State resynchronization
        
        tactic to synchronize the state of two or more machines/services after the repair.
    - - Software upgrade
        
        This is another preparation-and-repair tactic whose goal is to achieve in-service upgrades to executable code images in a non-service-affecting manner. This may be realized as a function patch or a class patch.
      - Retry
        
        This tactic assumes that the fault that caused a failure is transient and retrying the operation may lead to success. This tactic is commonly used in network management.
      - Rollback
        
        This tactic allow system to revert to a previous known state
      - Ignore faulty behaviour
        
        This tactic calls for ignoring messages sent from a particular source when we determine that those messages are false. For example, we would like to ignore the messages of an exten1al component launching a denial-of-service attack by establishing Access Control List filters, which allow the system administrator to control which routing packets are permitted or denied in a network.
      - Exception handling
        
        Once an exception has been detected, the system must handle it in some ways. The mechanism employed for exception handling depends largely on the programming environment employed.
      - Degradation tactic
        
        maintains the most critical system functions in the presence of component failures by dropping less critical functions.
      - Passive redundancy (warm spare)
        
        This refers to a configuration where only the active members of the protection group process input traffic; one of their duties is to provide the redundant spare( s) with periodic state updates.
      - Reconfiguration
        
        a tactic that attempts to recover from component failures by reassigning responsibilities to the (potentially restricted) resources which are still operational, while maintaining as much functionality.
      - Active redundancy (Hot spare)
        
        This refers to a configuration where all of the nodes (active or redundant spare) in a protection group receive and process identical inputs in parallel, allowing the redundant spare(s) to maintain synchronous state with the active node(s). Because the redundant spare possesses an identical state to the active processor, it can take over from a failed component in a matter of milliseconds.
  - - - Design it to handle more case faults as part of its normal operation
    - - Take corrective action when conditions are detected that are predictive of likely future faults
    - - This tactic refers to temporarily placing a system component in an out-of-service state for the purpose of mitigating potential system failures.
  - - - tactics which is used to detect incorrect sequence of events, mainly in distributed message-passing systems.
    - - Tactic to check validity or reasonableness of specific operations or output of a component
    - - A fault detection tactics that employs a periodic message exchange between system monitor and process being monitored
      - Sent in regular interval -> if receiving point does not receive a heartbeat for a time, the machine that should have sent the heartbeat is assumed to have failed
    - - tactics that refers to the detection of a system condition which alters the normal flow of execution
        
        System exceptions will vary according to the processor hardware architecture employed and include faults such as divide by zero, bus and address faults, illegal program instructions,and so forth.
        
        Timeout is a tactic that raises an exception when a component detects that it or another component has failed to meet its timing constraints. For example, a component awaiting a response from another component can raise an exception if the wait time exceeds a certain value.
    - - A component that is used to monitor the state of health of various other parts of the system (processor, memory, io)
      - Can detect failure or congestion in the network or other resources
    - - Tactics where components can run procedure to test themselves for the correct operation
    - - By asynchronous request/response message pair exchanged between nodes which used to determine reachability of a node through associated network path.
      - Echo is used to determine the pinged component is alive or not
      - Require time threshold to be set -> to tell how long to wait for the echo before considering pinged component to have failed
- - - - Normal performance
      - Operational degradation
      - Functional failure
      - Unintended functions
      - Inadvertent function
    - - Catastrophic
      - Hazardous
      - Major
      - Minor
      - No effect
  - - - Graphic aid -> helps identify all sequential and parallel sequence of contributing faults, which may cause hardware failure, human error, software errors.
      - If there are top event has occurred, then means 1/more contributing failure has occur, thus graphic aid can used to track down failures and repair them