The Five Nines Concept
Measure to Improve Availability
Disaster Recovery
Incident Response
High Availability
The Five Nines
Five Nines mean that systems and services are available 99.999% of the time.
also means both planned and unplanned downtime is less than 5.26 minutes per year
High Availability refers to a system or component that is continuously operational for a given length of time.
To help ensure high availability
Design for reliability
Detect failures as they occur
Eliminate single points of failure
Environments That Require Five Nines
Healthcare facilities require high availability to provide around-the-clock care for patients
The Public Safety industry includes agencies that provide security and services to a community, state, or nation
The Finance Industry needs to maintain high availability for continuous trading, compliance, and customer trust
the Retail Industry depends on efficient supply chains and the delivery of products to customers.
Disruption can be devastating, especially during peak demand times such as holidays
Threats to Availability
There are many different types of threats to high availability, the threats can range from failure of a mission-critical application to severe storm such as a hurricane or tornado.
Threats can also include catastrophic event such
as a terrorist attack, building bombing, or building fires.
Designing a High Availability System
High availability incorporates three major principles to achieve the goal of uninterrupted access to data and services
System Resiliency
Fault Tolerance
Elimination or reduction of Single-points of failure
Asset Management
an Org. needs to know what hardware and software assets they have in order to protect them.
Asset Management includes a complete inventory of hardware and software.
This means that the Org needs to know all of components that can be subject to security risks
including
Every network devices OS
Every software application
Every Hardware network device
All firmware
Every OS
All language runtime environments
Every hardware system
all individual libraries
Asset Classification
Asset Standardization
Threat Identification
Risk Analysis
Mitigation
Defense In Depth
Assigns all resources of an org into a group based on common characteristics.
An Org should apply an asset classification system to documents, data records, data files, and disks
As part of an IT asset management system, an Org specifies the acceptable IT assets that meet its objectives
The United States Computer Emergency Readiness Team (US-CERT) and the U.S. Department of Homeland Security sponsor a dictionary of Common Vulnerabilities and Exposure (CVE).
the CVE identification contains a standard identifier number with a brief description, and references to related vulnerability reports and advisories.
is the process of analyzing the dangers posed by natural and human-caused events to the assets of an org.
A user performs an asset ID to help determine which assets to protect
involves reducing the severity of the loss or the likelihood of the loss from occurring.
Many technical controls mitigate risk including authentication systems, file permissions, and firewalls
Defense in dept will not provide an impenetrable cyber shield, but it will help an org minimize risk by keeping it one step ahead of cyber criminals.
the Media Industry, the news cycle is now around the clock 24/7/365
Threat Categories
Sabotage
Hardware Failures
Software Attacks
Software Errors
Theft
Human Error
Utility Interruption
Natural Disasters
Steps to ID and Classify Assets
Step 2: Asset Accountability
Step 3: Classification Schema Criteria
Step 1: Asset ID categories
Step 4: Classification Schema Implementation
Info assets
ID the owner for all info assets
Confidentiality
Adopt a uniform way of identifying info to enure uniform protection
Software assets
Physical assets
Services
ID the owner for all application software
Value
Time
Access Rights
Each CVE ID includes
A brief desc of the security vulnerability
Any important referneces
The CVE ID #
ID vulnerabilities and threats
Quantify the probability and impact of the identified threats
ID assets and their value
Balance the impact of the threat against the cost of the countermeasure
Mitigation Strategies
Reduce the risk by designing a new business process with adequate built-in risk control and containment measures from the start
Avoid risk altogether would include measures such as physically disconnecting from the internet
Accept the Risk and Periodically re-assess accepted risks in ongoing processes as a normal feature of business operations and modify mitigation measures.
Transfer risks to an external agency (A service level agreement or insurance company)
to make sure data and info remains available, an org must create different layers of protection
Simplicity
Layering
Obscuring
Limiting
Diversity
Provides the most comprehensive protection.
If cybercrims. penetrate one layer, they still have to contend with several more layers with each layer being more complicated than the previous
Layering is creating a barrier of multiple defenses that coordinate together to prevent attacks
Access to data and info reduces the possibility of a threat.
An organization should implement the principle of least privilege
refers to changing the controls and procedures at different layers.
Breaching one layer does not compromise the whole system
An organization may use different encryption algorithms or authentication systems to protect data in different states
Obscuring info and also protect data and info.
An org should not reveal any info that cyber-crims can use to figure out what version of the OS a server is running or the type of equipment it uses
complexity does not necessarily guarantee security.
if the process or technology are too complex, misconfigs or failure to comply can result.
Simplicity can actually improve availability
Redundancy
Single Points of Failure
Must be identified and addressed
Can be a specific piece of hardware, a process, a specific piece of data or even an essential untility
single points of failure are the weak links in the chain that can cause disruption of the Org's operations
Generally, the solution to a single point of failure is to modify the critical operation so that it does not rely on a single element
The Org can also build redundant components into the critical operations to take over the process should one of these points fail
[N+1]
Ensures system availability in the even of a component failure
Redundant array of independent disks (RAID)
ex. a car has four tires (N) and a spare tire in the trunk in case of a flat (+1)
RAID combines multiple physical hard drives into a single logical unit to provide data redundancy and improve performance
RAID takes data that is normally stored on a single disk and spreads it out among several drives. If any single disk is lost, the user can recover data from the other disks where the data also resides
RAID can also increase the speed of data recovery
Using multiple drives makes retrieving requested data faster, instead of relying on just one disk to do the work
A RAID solution can be either hardware-based or software-based.
RAID types
Parity - Detects data errors
Striping - Writes data across multiple drives
Mirroring - Stores duplicate data on a second drive
Spanning Tree
A network protocol that provides for redundancy:
STP ensures that redundant physical links are loop-free.
It ensures that there is only one logical path between all destinations on the network
STP intentionally blocks redundant paths that could cause a loop
The basic function of STP is to prevent loops on a network when switches interconnect via multiple paths
Router Redundancy
The default gateway is typically the router that provides devices access to the rest of the network or to the Internet.
If there is only one router serving as the default gateway, it is a single point of failure.
involves Choosing to install an additional standby router
Involves the ability of a network to dynamically recover from the failure of a router acting as a default gateway known as first-hop redundancy
Router Redundancy Options
Hot Standby Router Protocol (HSRP)
Virtual Router Redundancy Protocol (VRRP)
HSRP provides high network availability by providing first-hop routing redundancy
Runs the VRRP protocol in conjunction with one or more other routers attached to a LAN.
In a VRRP configuration, the elected router is the virtual router master, and the other routers act as backups, in case the virtual router master fails
Gateway Load Balancing Protocol (GLBP)
GLBP protects data traffic from a failed router or circuit, like HSRP and VRRP, while also allowing load balancing (Also called load sharing) between a group of redundant routers
Location Redundancy
An Org may need to consider location redundancy depending on its needs.
Three forms of location redundancy
Synchronous
Asynchronous
Point-in-time-replication
Syncs both locations in real time
Not synchronized in real time but close to it
Updates the backup data location periodically
System Resilience
Resiliency defines the methods and configurations used to make a system or network tolerant of failure.
Routing protocols provide resiliency.
Requires High bandwidth
Locations must be close together to reduce latency
Requires less bandwidth
Sites can be further apart because latency is less of an issue
Most bandwidth conservative option because it does not require a constant connection
Resilient design is more than just adding redundancy. Resiliency is critical to understand the business needs of the organization, and then incorporate redundancy to create a resilient network.
Application Resilience
The application's ability to react to problems in one of its components while still functioning
Many Orgs balance out the cost of resiliency of application infrastructure with the cost of losing customers or business due to application failure
Application high availability is complex and costly
Availability Solutions
Cluster Architecture
Backup and Restore
Fault Tolerant Hardware
A system designed by building multiples of all critical components into the same computer
A group of servers that act like a single system
Copying files for the purpose of being able to restore them if data loss occurs
IOS Resilience
the Interwork Operating System (IOS) for Cisco routers and switches include a resilient configuration feature
Allows Faster Recovery
Maintains a secure working copy of the router IOS image file and a copy of the running config file
Response Phases
Response Technologies
Detection and Analysis
Containment and Eradication, and Recovery
Preparation
Post-Incident Follow-Up
planning for potential incidents
Discovering the incident
Efforts to immediately contain or eradicate the threat and begin recovery efforts
Investigate the cause of the incident and ask questions to better understand the nature of the threat
NetFlow and IPFIX
Intrusion Prevention Systems (IPSs)
Intrusion Detection Systems (IDSs)
Advanced Threat Intelligence
Network Admission Control (NAC)
allows network access for authorized users with compliant systems.
A complaint system meets all of the policy requirements of the org
monitor the traffic on a network
operates in inline mode
Netflow is a Cisco IOS technology that provides statistic on packets flowing through a Cisco router or multilayer switch.
can help an org detect attacks during one of the stages of the cyberattack (and sometimes before with the right info)
IDS systems are passive
It can detect and immediately address a network problem
The Internet Engineering Task Force (IETF) used Cisco's NetFlow Version 9 as the basis for IP Flow Information Export (IPFIX)
Orgs should have a response plan and a Computer Security Inicdent Response Team (CSIRT) to manage the response
CSIRT is responsible for
Ensuring its members know about the plan
testing the plan
Maintaining the incident response plan
getting management's approval of the plan
Orgs can have the best detection systems: however, if admins do not review the logs and monitor alerts its useless
includes
Alerts and notfications
Monitoring and follow-up
Incident analysis helps to identify the source, extent, impact, and details of a data breach
May require additional downtime for systems
Questions include
What preventive measures need strengthening?
How can it improve system monitoring?
How can it minimize downtime during the containment, eradication, and recovery phases?
What actions will prevent the incident from reoccurring?
How can management minimize the impact to the business?
NAC evaluates an incoming device against the policies of the network
NAC can quarantine the systems that do not comply and manages the remediation of noncompliant systems
Common NAC systems check include
Operating systems patches and updates
Complex password enforcement
Updated Virus Detection
analyzes the copies of traffic rather than the actual forwarded packets
Working offline, it compares the captured traffic stream with known malicious signatures
Physically positioned in the network, traffic must be mirrored in order to reach it
Network traffic does not pass through the IDS unless it is mirrored
Does not negatively affect the packet flow of the forwarded traffic
IPFIX is a standard for exporting router-based information about network traffic
IPFIX analysis benefits
Troubleshoots network failures quickly and precisely
Analyzes network flows for capacity planning
Secures the network against internal and external threats
Security Alerts
Components of a Cyberattack
Account lockouts
All database events
Asset creation and deletion
Configuration modification to systems
Delivery
Infrastructure
Victim
Motivation
Actor
Identity
Location
Forensics
Mechanism
Exploit
Malware
Domains
Operations
Servers
Monetary
Espionage
Politics
Role
Connections
Identity
Disaster Recovery Planning
Types of Disasters
Natural Disasters
Human-caused Disasters
It is critical to keep an organization functioning when a disaster occurs.
A disaster includes any natural or human-caused event that damages assets or property and impairs the ability for the organization to continue operating
Meteorological disasters include hurricanes, tornadoes, snowstorms, lightning and hail
Health disaster include widespread illnesses, quarantines, and pandemics
Miscellaneous disasters include fires, floods, solar storms, and avalanches
Geological disaster includes earthquakes, landslides, volcanoes, and tsunamis
Social-political events include vandalism, blockades, protests, sabotage, terrorism, and war
Materials events include hazardous spills and fires
Labor events include strikes, walkouts, and slowdowns
Utilities disruptions include power failures, communication outages, fuel shortages, and radioactive fallout
Disaster Recovery Plan (DRP)
an Org puts its DRP into action while the disaster is ongoing, and employees are scrambling to ensure critical systems are online
A DRP includes
Who is responsible for this process?
What does the individual need to perform the process?
Where does the individual perform this process?
What is the process?
Why is the process critical?
Implementing DR Controls
controls minimize the effects of a disaster to ensure that resources and business processes can resume operations
There are three types of Controls
Preventive Controls
Detective Controls
Corrective Controls
Keeping data backed up
Keeping data backups off-site
Using surge protectors
Installing Generators
Using up-to-date antivirus software
Installing server and network monitoring software
Keeping critical documents in the disaster recovery plan
Business Continuity Planning
Need for Business Continuity
Business Continuity Considerations
Business Continuity Best Practice
one of the most important concepts in computer security. Even though companies do whatever they can to prevent disasters and loss of data, it is impossible to predict every scenario.
It is important for companies to have plans in place that ensure business continuity regardless of what may occur
Business continuity controls are more than just backing up data and providing redundant hardware
Considerations should include
Establishing alternate communications channels
Providing power
Identifying all dependencies for applications and processes
Documenting Configurations
Understanding how to carry out automated tasks manually
- Write a policy that provides guidance to develop the business continuity plan and assigns roles to carry out the tasks
- Identify critical systems and processes, and prioritize them based on necessity
- Identify vulnerabilities, threats, and calculate risks
- Identify and implement controls and countermeasures to reduce risk
- Devise methods to bring back critical systems quickly
- Write procedures to keep the organization functioning when in a chaotic state
- Test the plan
- Update the plan regularly
A business continuity plan is a broader plan than a DRP
Getting the right people to the right places