Availability (Design checklist for availability (Mapping among…
Design checklist for availability
Allocation of resources
Determine the system responsibilities that need to be highly available by ensure there are responsibilities to do the following
Log the fault
Notify appropriate people or system
Disable to source of events causing the fault
Fix or mask the fault/failure
Operate in degraded mode
Determine the system responsibilities that need to be highly available. Ensure that the following are carried out (among others):
Ensure coordination model can detect omission, crash, incorrect timing, or incorrect response
Ensure the coordination model is enable the logging of the fault, notification of appropriate entities, disabling of the source of the events causing the fault, fixing or masking the fault, or operating in a degraded mode
Ensure that the coordination models support the replacement of the artefacts used (processors, communications channels, persistent storage, and processes).
Determine which portion of the system that need to be highly available.
Mapping among architectural elements
Determine the artefact (processor, communication channels, persistent storage, or processes) may produce a fault, omission, crash, incorrect timing, or incorrect response
Ensure that the mapping of architectural elements is flexible enough to permit the recovery from fault
Determine what critical resource that are necessary to continue to operate with presence of fault, omission, crash, incorrect timing etc
Ensure that there are sufficient remaining resources in the event of a fault to -> log the fault, notify appropriate personnel, disable src causing the fault
Binding time decision
Determine how and when architectural elements are bound.
Choice of technology
Determine the available technologies that can (help) detect faults, recover from faults, or reintroduce failed components.
Availability general scenario
This specifies the resource that is required to be highly available, such as a processor, communication channel, or storage.
The state of the system when the fault or failure occurs. It will also affect the desired system response, such as whether to shut the system down completely or only shut down only certain function
Fault of one of the following class occur
Omission -> A component fails to respond to an input
Crash -> the component repeatedly suffer omission faults
Timing -> A component responds but the response is early or late
Response -> A component responds with an incorrect value
The reaction to a system fault.
Source of stimulus
Internal/external origins of faults or failure
EG. availability percentage, specify the time to detect the fault, time to repair fault, time interval during which the system must be available
Is a quality attributes
Meaning: Readiness of a software for carrying out a task
Build on top of reliability by adding notion of recovery
Recovery --> recover a failed system to normal state
Minimize the service outage time by mitigating faults
Failure cause by fault
Fault can be prevented, tolerated
Availability of computer system usually expressed as SLA --> which is the availability level that is guaranteed and penalties suffered by the computer system if SLA is violated
Tactics for availability
Reintroduction is where a failed component is reintroduced after it has been corrected.
The shadow tactic refers to operating a previously failed or in-service upgraded component in a "shadow mode" for a pre-defined duration of time prior to reverting the component back to an active role. During this duration its behavior can be monitored for correctness and it can repopulate its state incrementally.
tactic to synchronize the state of two or more machines/services after the repair.
This is another preparation-and-repair tactic whose goal is to achieve in-service upgrades to executable code images in a non-service-affecting manner. This may be realized as a function patch or a class patch.
This tactic assumes that the fault that caused a failure is transient and retrying the operation may lead to success. This tactic is commonly used in network management.
This tactic allow system to revert to a previous known state
Ignore faulty behaviour
This tactic calls for ignoring messages sent from a particular source when we determine that those messages are false. For example, we would like to ignore the messages of an exten1al component launching a denial-of-service attack by establishing Access Control List filters, which allow the system administrator to control which routing packets are permitted or denied in a network.
Once an exception has been detected, the system must handle it in some ways. The mechanism employed for exception handling depends largely on the programming environment employed.
maintains the most critical system functions in the presence of component failures by dropping less critical functions.
Passive redundancy (warm spare)
This refers to a configuration where only the active members of the protection group process input traffic; one of their duties is to provide the redundant spare( s) with periodic state updates.
a tactic that attempts to recover from component failures by reassigning responsibilities to the (potentially restricted) resources which are still operational, while maintaining as much functionality.
Active redundancy (Hot spare)
This refers to a configuration where all of the nodes (active or redundant spare) in a protection group receive and process identical inputs in parallel, allowing the redundant spare(s) to maintain synchronous state with the active node(s). Because the redundant spare possesses an identical state to the active processor, it can take over from a failed component in a matter of milliseconds.
Increase competence set tactic
Design it to handle more case faults as part of its normal operation
Take corrective action when conditions are detected that are predictive of likely future faults
Removal from service
This tactic refers to temporarily placing a system component in an out-of-service state for the purpose of mitigating potential system failures.
Fault detection tactic
tactics which is used to detect incorrect sequence of events, mainly in distributed message-passing systems.
Tactic to check validity or reasonableness of specific operations or output of a component
A fault detection tactics that employs a periodic message exchange between system monitor and process being monitored
Sent in regular interval -> if receiving point does not receive a heartbeat for a time, the machine that should have sent the heartbeat is assumed to have failed
tactics that refers to the detection of a system condition which alters the normal flow of execution
System exceptions will vary according to the processor hardware architecture employed and include faults such as divide by zero, bus and address faults, illegal program instructions,and so forth.
Timeout is a tactic that raises an exception when a component detects that it or another component has failed to meet its timing constraints. For example, a component awaiting a response from another component can raise an exception if the wait time exceeds a certain value.
A component that is used to monitor the state of health of various other parts of the system (processor, memory, io)
Can detect failure or congestion in the network or other resources
Tactics where components can run procedure to test themselves for the correct operation
By asynchronous request/response message pair exchanged between nodes which used to determine reachability of a node through associated network path.
Echo is used to determine the pinged component is alive or not
Require time threshold to be set -> to tell how long to wait for the echo before considering pinged component to have failed
How to plan for failure
1) Hazard analysis
By characterizing the hazards that can occur during the operation of a system according to its severity
5 factors are examined in the light of software hazard (Safeware Engineering Corporation)
5 failure condition levels
2) Fault tree analysis
Specify the state of the system that negatively impact safety and reliability
Analyze the system context and operation to find ways that undesired state could occur
Graphic aid -> helps identify all sequential and parallel sequence of contributing faults, which may cause hardware failure, human error, software errors.
If there are top event has occurred, then means 1/more contributing failure has occur, thus graphic aid can used to track down failures and repair them