The Five Nines Concept

Measure to Improve Availability

Disaster Recovery

Incident Response

High Availability

The Five Nines

Five Nines mean that systems and services are available 99.999% of the time.
also means both planned and unplanned downtime is less than 5.26 minutes per year

High Availability refers to a system or component that is continuously operational for a given length of time.

To help ensure high availability

Design for reliability

Detect failures as they occur

Eliminate single points of failure

Environments That Require Five Nines

Healthcare facilities require high availability to provide around-the-clock care for patients

The Public Safety industry includes agencies that provide security and services to a community, state, or nation

The Finance Industry needs to maintain high availability for continuous trading, compliance, and customer trust

the Retail Industry depends on efficient supply chains and the delivery of products to customers.
Disruption can be devastating, especially during peak demand times such as holidays

Threats to Availability

There are many different types of threats to high availability, the threats can range from failure of a mission-critical application to severe storm such as a hurricane or tornado.

Threats can also include catastrophic event such
as a terrorist attack, building bombing, or building fires.

Designing a High Availability System

High availability incorporates three major principles to achieve the goal of uninterrupted access to data and services

System Resiliency

Fault Tolerance

Elimination or reduction of Single-points of failure

Asset Management

an Org. needs to know what hardware and software assets they have in order to protect them.

Asset Management includes a complete inventory of hardware and software.
This means that the Org needs to know all of components that can be subject to security risks

including

Every network devices OS

Every software application

Every Hardware network device

All firmware

Every OS

All language runtime environments

Every hardware system

all individual libraries

Asset Classification

Asset Standardization

Threat Identification

Risk Analysis

Mitigation

Defense In Depth

Assigns all resources of an org into a group based on common characteristics.

An Org should apply an asset classification system to documents, data records, data files, and disks

As part of an IT asset management system, an Org specifies the acceptable IT assets that meet its objectives

The United States Computer Emergency Readiness Team (US-CERT) and the U.S. Department of Homeland Security sponsor a dictionary of Common Vulnerabilities and Exposure (CVE).
the CVE identification contains a standard identifier number with a brief description, and references to related vulnerability reports and advisories.

is the process of analyzing the dangers posed by natural and human-caused events to the assets of an org.

A user performs an asset ID to help determine which assets to protect

involves reducing the severity of the loss or the likelihood of the loss from occurring.

Many technical controls mitigate risk including authentication systems, file permissions, and firewalls

Defense in dept will not provide an impenetrable cyber shield, but it will help an org minimize risk by keeping it one step ahead of cyber criminals.

the Media Industry, the news cycle is now around the clock 24/7/365

Threat Categories

Sabotage

Hardware Failures

Software Attacks

Software Errors

Theft

Human Error

Utility Interruption

Natural Disasters

Steps to ID and Classify Assets

Step 2: Asset Accountability

Step 3: Classification Schema Criteria

Step 1: Asset ID categories

Step 4: Classification Schema Implementation

Info assets

ID the owner for all info assets

Confidentiality

Adopt a uniform way of identifying info to enure uniform protection

Software assets

Physical assets

Services

ID the owner for all application software

Value

Time

Access Rights

Each CVE ID includes

A brief desc of the security vulnerability

Any important referneces

The CVE ID #

ID vulnerabilities and threats

Quantify the probability and impact of the identified threats

ID assets and their value

Balance the impact of the threat against the cost of the countermeasure

Mitigation Strategies

Reduce the risk by designing a new business process with adequate built-in risk control and containment measures from the start

Avoid risk altogether would include measures such as physically disconnecting from the internet

Accept the Risk and Periodically re-assess accepted risks in ongoing processes as a normal feature of business operations and modify mitigation measures.

Transfer risks to an external agency (A service level agreement or insurance company)

to make sure data and info remains available, an org must create different layers of protection

Simplicity

Layering

Obscuring

Limiting

Diversity

Provides the most comprehensive protection.

If cybercrims. penetrate one layer, they still have to contend with several more layers with each layer being more complicated than the previous

Layering is creating a barrier of multiple defenses that coordinate together to prevent attacks

Access to data and info reduces the possibility of a threat.
An organization should implement the principle of least privilege

refers to changing the controls and procedures at different layers.

Breaching one layer does not compromise the whole system

An organization may use different encryption algorithms or authentication systems to protect data in different states

Obscuring info and also protect data and info.

An org should not reveal any info that cyber-crims can use to figure out what version of the OS a server is running or the type of equipment it uses

complexity does not necessarily guarantee security.
if the process or technology are too complex, misconfigs or failure to comply can result.

Simplicity can actually improve availability

Redundancy

Single Points of Failure

Must be identified and addressed

Can be a specific piece of hardware, a process, a specific piece of data or even an essential untility

single points of failure are the weak links in the chain that can cause disruption of the Org's operations

Generally, the solution to a single point of failure is to modify the critical operation so that it does not rely on a single element

The Org can also build redundant components into the critical operations to take over the process should one of these points fail

[N+1]

Ensures system availability in the even of a component failure

Redundant array of independent disks (RAID)

ex. a car has four tires (N) and a spare tire in the trunk in case of a flat (+1)

RAID combines multiple physical hard drives into a single logical unit to provide data redundancy and improve performance

RAID takes data that is normally stored on a single disk and spreads it out among several drives. If any single disk is lost, the user can recover data from the other disks where the data also resides

RAID can also increase the speed of data recovery

Using multiple drives makes retrieving requested data faster, instead of relying on just one disk to do the work

A RAID solution can be either hardware-based or software-based.

RAID types

Parity - Detects data errors

Striping - Writes data across multiple drives

Mirroring - Stores duplicate data on a second drive

Spanning Tree
A network protocol that provides for redundancy:

STP ensures that redundant physical links are loop-free.
It ensures that there is only one logical path between all destinations on the network

STP intentionally blocks redundant paths that could cause a loop

The basic function of STP is to prevent loops on a network when switches interconnect via multiple paths

Router Redundancy

The default gateway is typically the router that provides devices access to the rest of the network or to the Internet.
If there is only one router serving as the default gateway, it is a single point of failure.

involves Choosing to install an additional standby router

Involves the ability of a network to dynamically recover from the failure of a router acting as a default gateway known as first-hop redundancy

Router Redundancy Options

Hot Standby Router Protocol (HSRP)

Virtual Router Redundancy Protocol (VRRP)

HSRP provides high network availability by providing first-hop routing redundancy

Runs the VRRP protocol in conjunction with one or more other routers attached to a LAN.

In a VRRP configuration, the elected router is the virtual router master, and the other routers act as backups, in case the virtual router master fails

Gateway Load Balancing Protocol (GLBP)

GLBP protects data traffic from a failed router or circuit, like HSRP and VRRP, while also allowing load balancing (Also called load sharing) between a group of redundant routers

Location Redundancy

An Org may need to consider location redundancy depending on its needs.

Three forms of location redundancy

Synchronous

Asynchronous

Point-in-time-replication

Syncs both locations in real time

Not synchronized in real time but close to it

Updates the backup data location periodically

System Resilience

Resiliency defines the methods and configurations used to make a system or network tolerant of failure.
Routing protocols provide resiliency.

Requires High bandwidth

Locations must be close together to reduce latency

Requires less bandwidth

Sites can be further apart because latency is less of an issue

Most bandwidth conservative option because it does not require a constant connection

Resilient design is more than just adding redundancy. Resiliency is critical to understand the business needs of the organization, and then incorporate redundancy to create a resilient network.

Application Resilience

The application's ability to react to problems in one of its components while still functioning

Many Orgs balance out the cost of resiliency of application infrastructure with the cost of losing customers or business due to application failure

Application high availability is complex and costly

Availability Solutions

Cluster Architecture

Backup and Restore

Fault Tolerant Hardware

A system designed by building multiples of all critical components into the same computer

A group of servers that act like a single system

Copying files for the purpose of being able to restore them if data loss occurs

IOS Resilience

the Interwork Operating System (IOS) for Cisco routers and switches include a resilient configuration feature

Allows Faster Recovery

Maintains a secure working copy of the router IOS image file and a copy of the running config file

Response Phases

Response Technologies

Detection and Analysis

Containment and Eradication, and Recovery

Preparation

Post-Incident Follow-Up

planning for potential incidents

Discovering the incident

Efforts to immediately contain or eradicate the threat and begin recovery efforts

Investigate the cause of the incident and ask questions to better understand the nature of the threat

NetFlow and IPFIX

Intrusion Prevention Systems (IPSs)

Intrusion Detection Systems (IDSs)

Advanced Threat Intelligence

Network Admission Control (NAC)

allows network access for authorized users with compliant systems.
A complaint system meets all of the policy requirements of the org

monitor the traffic on a network

operates in inline mode

Netflow is a Cisco IOS technology that provides statistic on packets flowing through a Cisco router or multilayer switch.

can help an org detect attacks during one of the stages of the cyberattack (and sometimes before with the right info)

IDS systems are passive

It can detect and immediately address a network problem

The Internet Engineering Task Force (IETF) used Cisco's NetFlow Version 9 as the basis for IP Flow Information Export (IPFIX)

Orgs should have a response plan and a Computer Security Inicdent Response Team (CSIRT) to manage the response

CSIRT is responsible for

Ensuring its members know about the plan

testing the plan

Maintaining the incident response plan

getting management's approval of the plan

Orgs can have the best detection systems: however, if admins do not review the logs and monitor alerts its useless

includes

Alerts and notfications

Monitoring and follow-up

Incident analysis helps to identify the source, extent, impact, and details of a data breach

May require additional downtime for systems

Questions include

What preventive measures need strengthening?

How can it improve system monitoring?

How can it minimize downtime during the containment, eradication, and recovery phases?

What actions will prevent the incident from reoccurring?

How can management minimize the impact to the business?

NAC evaluates an incoming device against the policies of the network

NAC can quarantine the systems that do not comply and manages the remediation of noncompliant systems

Common NAC systems check include

Operating systems patches and updates

Complex password enforcement

Updated Virus Detection

analyzes the copies of traffic rather than the actual forwarded packets

Working offline, it compares the captured traffic stream with known malicious signatures

Physically positioned in the network, traffic must be mirrored in order to reach it

Network traffic does not pass through the IDS unless it is mirrored

Does not negatively affect the packet flow of the forwarded traffic

IPFIX is a standard for exporting router-based information about network traffic

IPFIX analysis benefits

Troubleshoots network failures quickly and precisely

Analyzes network flows for capacity planning

Secures the network against internal and external threats

Security Alerts

Components of a Cyberattack

Account lockouts

All database events

Asset creation and deletion

Configuration modification to systems

Delivery

Infrastructure

Victim

Motivation

Actor

Identity

Location

Forensics

Mechanism

Exploit

Malware

Domains

Operations

Servers

Monetary

Espionage

Politics

Role

Connections

Identity

Disaster Recovery Planning

Types of Disasters

Natural Disasters

Human-caused Disasters

It is critical to keep an organization functioning when a disaster occurs.
A disaster includes any natural or human-caused event that damages assets or property and impairs the ability for the organization to continue operating

Meteorological disasters include hurricanes, tornadoes, snowstorms, lightning and hail

Health disaster include widespread illnesses, quarantines, and pandemics

Miscellaneous disasters include fires, floods, solar storms, and avalanches

Geological disaster includes earthquakes, landslides, volcanoes, and tsunamis

Social-political events include vandalism, blockades, protests, sabotage, terrorism, and war

Materials events include hazardous spills and fires

Labor events include strikes, walkouts, and slowdowns

Utilities disruptions include power failures, communication outages, fuel shortages, and radioactive fallout

Disaster Recovery Plan (DRP)

an Org puts its DRP into action while the disaster is ongoing, and employees are scrambling to ensure critical systems are online

A DRP includes

Who is responsible for this process?

What does the individual need to perform the process?

Where does the individual perform this process?

What is the process?

Why is the process critical?

Implementing DR Controls

controls minimize the effects of a disaster to ensure that resources and business processes can resume operations

There are three types of Controls

Preventive Controls

Detective Controls

Corrective Controls

Keeping data backed up

Keeping data backups off-site

Using surge protectors

Installing Generators

Using up-to-date antivirus software

Installing server and network monitoring software

Keeping critical documents in the disaster recovery plan

Business Continuity Planning

Need for Business Continuity

Business Continuity Considerations

Business Continuity Best Practice

one of the most important concepts in computer security. Even though companies do whatever they can to prevent disasters and loss of data, it is impossible to predict every scenario.

It is important for companies to have plans in place that ensure business continuity regardless of what may occur

Business continuity controls are more than just backing up data and providing redundant hardware

Considerations should include

Establishing alternate communications channels

Providing power

Identifying all dependencies for applications and processes

Documenting Configurations

Understanding how to carry out automated tasks manually

  1. Write a policy that provides guidance to develop the business continuity plan and assigns roles to carry out the tasks
  1. Identify critical systems and processes, and prioritize them based on necessity
  1. Identify vulnerabilities, threats, and calculate risks
  1. Identify and implement controls and countermeasures to reduce risk
  1. Devise methods to bring back critical systems quickly
  1. Write procedures to keep the organization functioning when in a chaotic state
  1. Test the plan
  1. Update the plan regularly

A business continuity plan is a broader plan than a DRP

Getting the right people to the right places