Please enable JavaScript.
Coggle requires JavaScript to display documents.
Reliability and Fault Tolerance (N-Version Programming vs Recovery Blocks,…
Reliability and Fault Tolerance
Aims
To understand the factors which affect the reliability of a system and introduce how software design faults can be tolerated
Safety and Dependability
Safety
freedom from those conditions that can cause death, injury, occupational illness, damage to *r loss of) equipment (or property), or environmental harm
Reliability
a measure of the success with which a system conforms to some authoritative specification of its behaviour
Dependability
Attributes
Reliable
Safe
Confidential
Integral
Maintainable
Available
Means
Fault Prevention
Fault Tolerance
Fault Removal
Fault Forecasting
Impairments
Faults
Errors
Failures
Reliability, failure and faults
When the behaviour of a system deviates from that which is specified for it, this is called a
failure
Failures result from unexpected problems internal to the system that eventually manifest themselves in the system's external behaviour and these problems are called
errors
and their mechanical or algorithmic cause are termed
faults
Failure > Fault > Error
Fault Types
Transient fault
Fault that starts at a particular time, remains in the system for some period and then disappears
E.g. communications systems
Permanent faults
Faults that remain in the system until they are reparied
e.g. broken wire or a software design error
Intermittent faults
faults that is transient that occur from time to time
e.g. a hardware component that is heat sensitive, it works for a time, stops working, cools down and then starts to work again.
Failure modes
value domain
Constraint error
Value error
Timing domain
Early
Omission
Fail silent
Fail stop
Fail controlled
Late
Arbitrary
Fail uncontrolled
Fault prevention and fault tolerance
Fault prevention
fault avoidance
Attempts to limit the introduction of faults during system construction
fault removal
Procedures for finding and removing the causes of errors
E.g.
design reviews
program verification
code inspections
system testing
Fault Tolerance
Levels of Fault Tolerance
Graceful Degradation (fail soft)
The system continues to operate in the presence of errors, accepting a partial degradation of functionality or performance during recovery or repair
Fail Safe
The system maintains its integity while accepting a temporary halt in its operation
Full Fault Tolerance
System continues to operate in the presence of faults, albeit for a limited period, with no significant loss of functionality or performance
Redundancy (protective redundancy)
Aims
minimise redundancy while maximising reliability, subject to the cost and size constraints of the system
Advisable to separate out the fault-tolerant components from the rest of the system
Hardware Fault Tolerance
Static (masking) redundancy
redundant components are used inside a system to hide the effects of faults
e.g.
Triple Modular Redundance(TMR)
3 identical subcomponents and majority voting circuits;
the outputs are compared and if one differs from the other two, that output is masked out
NMR
To mask faults from more than one component
Dynamic redundancy
Redundancy supplied inside a component which indicates that the output is in error
Provides an error detection facility
recovery must be provided by another component
e.g.
Communications checksums
memory parity bits
Software Fault Tolerance
Used for detecting design errors
Static
N-Version programming
depends on
initial specification
independence of effort
Adequate budget
Dynamic Redundancy
error detection
no fault tolerance scheme can be utilised until the associated error is detected
type of error detection
Environmental detection
Application detection
damage confinement and assessment
to what extent has the system been corrupted?
error recovery
techniques should aim to transform the corrupted system into a state from which it can continue its normal operation (perhaps with degraded functionality
2 approaches
forward error recovery
continues from an errorneous state by making selective corrections to the system state
backward error recovery
BER relies on restoring the system to a previous safe state and executing an alternative section of the program
fault treatment and continued service
an error is a symptom of a fault;
although the damage is repaired, the fault may still exist
Fault Treatment
2 stage
fault location
system repair
Both approaches attempt to produces systems which have well-defined failure modes
N-Version Programming vs Recovery Blocks
Static
(NV) vs
dynamic
redundancy(RB)
Design overheads
both require alternative algorithms, NV requires driver, RB requires acceptance test
Runtime overheads
NV requires N * resources, RB requires establishing recovery points
Diversity of desing
both susceptible to errors in requirements
Error detection
vote comparison (NV) versus acceptance test(RB)
Atomicity
NV votes before it outputs to the environment, RB must be structure to only output following the passing of an acceptance test