Please enable JavaScript.
Coggle requires JavaScript to display documents.
Monitoring, Logging and Remediation - Coggle Diagram
Monitoring, Logging and Remediation
CloudWatch
CloudWatch Alarms
Key Concepts:
- Alarms: you can generate alarm from any metric, including estimated charges on your AWS bill (you need to enable billing alerts)
- Thresholds: static and anomaly detection, used to trigger alarms and actions to be taken if an alarm state is reached
- CloudWatch can be used to monitor your service quotas / limits and notify you if if you are about to reach the limit (for a subset of services)
- AWS Health (API behind Personal Health Dashboard) can send events to EventBridge, triggering a Cloud Watch Alarm, which can trigger an action AWS Health (changes in the health, send health events to EventBridge)-> EventBridge (trigger a CW Alarm) -> CloudWatch Alarm (trigger an Action) -> Action (send SNS notification, send message to SQS, trigger Lambda)
AWS Health -> EventBridge -> CW Alarm -> SNS/SQS/Lambda
CloudWatch alarm actions is one of the EventBridge built-in targets for AWS Health
- Composite alarms determine their states by monitoring the states of other alarms. You can use composite alarms to reduce alarm noise
- Use Case:
- An alarm sends SNS notification or executes an Auto Scaling policy if CPU utilization exceeds 90% on your EC2 for more than 5 minutes
Creating Quota/Limits Alarm:
- Use CloudWatch alarms to notify you automatically whenever a specified quota reaches a percentage of the maximum or reaches the maximum level. Once you have created an alarm use the CloudWatch console to configure notifications
- Alarm threshold: from 50% to 100% of the applied quota value
- Alarm name: required
Creating CloudWatch Alarm:
- Metric
- Statistic - avg, sum, max, min, percentile, trimmed mean, ... more here
- Period - when evaluating the alarm, each period is aggregated into one data point
- Conditions
- Threshold type:
- Static: greater, greater/equal, lower/equal, lower
- Anomaly detection:
- Outside of the band, Greater than the band, Lower than the band
- Anomaly detection threshold: based on a standard deviation. Higher number means thicker band, lower number means thinner band
- Datapoints to alarm: (N out of M) define the number (N) of datapoints (periods) within the evaluation period (last M periods) that must be breaching to cause the alarm to go to ALARM state (more here)
- Missing data treatment:
- notBreaching – Missing data points are treated as "good" and within the threshold
- breaching – Missing data points are treated as "bad" and breaching the threshold
- ignore – The current alarm state is maintained
- missing – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT_DATA.
- Notification
- Alarm state trigger (Define the alarm state that will trigger this action):
- In alarm: the metric or expression is outside of the defined threshold
- OK: the metric or expression is within the defined threshold
- Insufficient data: the alarm has just started or not enough data is available
- Send a notification to SNS topic (new, existing or other account), you can have multiple notifications
- Auto Scaling action: choose the Resource type EC2 ASG (simple or step scaling policy) or ECS Service, you can have multiple actions
- EC2 action: Stop, Terminate and Reboot. While Recover is reserved for certain EC2 instance types
- Ticket action:
- Systems Manager action:
- Create OpsItem: this will create an OpsItem within OpsCenter with the specified severity and category
- Create incident: this will start an incident using the response plan as a template
- Default EC2 host-level metrics: CPU, network, disk and status check
- Metrics are stored indefinitely
- You can retrieve data from any EC2 or ELB instance, even after it has been terminated
- By default EC2 sends metric data to CloudWatch in 5 minutes intervals
- For an additional charge, you can enable detailed monitoring that sends metrics at 1 minute intervals
- For custom metrics, the default is 1 minute and you can configure high-resolution at 1 second intervals
- Dashboard are multi regions
CloudWatch Agent
CloudWatch Agent CLI basic operations:
- Install: sudo yum install amazon-cloudwatch-agent
- If you store the configuration file locally: /opt/aws/amazon-cloudwatch-agent/bin/config.json
- If you store the configuration file in SSM Parameter Store answer yes when prompted in the wizard
- Create configuration file: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
- Start:
- sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
- sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c ssm:configuration-parameter-store-name
- Stop: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a stop
The unified CloudWatch agent enables you to do the following:
- Collect internal system-level metrics from Amazon EC2 instances across operating systems. The metrics can include in-guest metrics, in addition to the metrics for EC2 instances. The additional metrics that can be collected are listed here
- Collect system-level metrics from on-premises servers. These can include servers in a hybrid environment as well as servers not managed by AWS
- Custom metrics:
- You can define your own metrics using aws cli or api (more here)
- Retrieve custom metrics from your applications or services using the StatsD (Linux/Windows) and collectd (Linux) protocols
- Collect logs from Amazon EC2 instances and on-premises servers, running either Linux or Windows Server
Key Facts:
- The default namespace for metrics collected by the CloudWatch agent is CWAgent, although you can specify a different namespace when you configure the agent
- Metrics collected by the CloudWatch agent are billed as custom metrics. For more information about CloudWatch metrics pricing, see here
- Supported on x86-64 and ARM64 (Amazon Linux 2, Ubuntu 18.04, 20.04, RHEL 7.6, SLES 15)
- Your Amazon EC2 instances must have outbound internet access to send data to CloudWatch or CloudWatch Logs (you must allow list the CloudWatch and CloudWatch Logs public endpoints for the appropriate Regions) or you can setup a CloudWatch / CloudWatch Log VPC endpoint powered by PrivateLink
- If you're using SSM to install the agent or Parameter Store to store your configuration file, you must allow list the SSM endpoints for the appropriate Regions
- You can send metrics and logs to a different AWS Account
Agent Installation:
- You can download and install the CloudWatch agent manually using the command line, or you can integrate it with SSM. In any case:
- Create IAM roles or users that enable the agent to collect metrics from the server and optionally to integrate with AWS Systems Manager
- Policies to be used to with configuration file: CloudWatchAgentServerPolicy and AmazonSSMManagedInstanceCore (install and configure with SSM)
- Policies to be used to store configuration in Parameter Store: CloudWatchAgentAdminPolicy and AmazonSSMManagedInstanceCore (install and configure with SSM). The permissions for writing to Parameter Store provide broad access. This role shouldn't be attached to all your servers, and only administrators should use it. After you create the agent configuration file and copy it to Parameter Store, you should detach this role from the instance and use CloudWatchAgentServerRole instead.
- Download the agent package
- Modify the CloudWatch agent configuration file and specify the metrics that you want to collect
- Install and start the agent on your servers. As you install the agent on an EC2 instance, you attach the IAM role that you created in step 1. As you install the agent on an on-premises server, you specify a named profile that contains the credentials of the IAM user that you created in step 1
Use SSM to install and configure CW agent:
- Download the CW agent using SSM:
- Run Command > Command document = AWS-ConfigureAWSPackage > Target (your selection) > Action= Install > Name = AmazonCloudWatchAgent > Version = latest
- Start the CW agent using SSM:
- Run Command > Command document = AmazonCloudWatch-ManageAgent > Target (your selection) > Action= Configure > Optional Configuration Source = ssm > Optional Configuration Location = name of the agent configuration file that you created and saved to SSM Parameter Store > Optional Restart list = yes
Common scenarios with the CloudWatch agent:
- Run the CloudWatch agent as a different user
- Add custom dimensions to metrics collected by the CloudWatch agent
- Use multiple CloudWatch agent configuration files
- Aggregate or roll up metrics collected by the CloudWatch agent
- Collect high-resolution metrics with the CloudWatch agent
Dimension
- Is a name/value pair that is part of the identity of a metric
- Whenever you add a unique name/value pair to one of your metrics, you are creating a new variation of that metric
- For example, many Amazon EC2 metrics publish InstanceId as a dimension name, and the actual instance ID as the value for that dimension
- You can assign up to 30 dimensions to a metric.
With the CloudWatch Agent:
- In custom metrics CLI:
- aws cloudwatch put-metric-data --dimensions MyName1=MyValue1, MyName2=MyValue2
- aws cloudwatch get-metric-statistics --dimensions Name=MyName1, Value=MyValue1, Name=MyName2, Value=MyValue2
- In CloudWatch agent configuration JSON file:
- append_dimensions with only the following options:
"ImageID":"${aws:ImageId}", "InstanceId":"${aws:InstanceId}", "InstanceType":"${aws:InstanceType}", "AutoScalingGroupName":"${aws:AutoScalingGroupName}"
Aggregate Metrics
- You can aggregate statistics for your EC2 instances that have detailed monitoring enabled
- Instances that use basic monitoring are not included
- You can aggregate the metrics for AWS resources across multiple resources
- Metrics are completely separate between Regions, but you can use metric math to aggregate similar metrics across Regions
Metric Math
- Query multiple CloudWatch metrics and use math expressions to create new time series based on these metrics
- Allows to aggregate and transform metrics from multiple accounts and Regions
- Visualize the resulting time series on the CloudWatch console and add them to dashboards
- Anomaly detection on metric math is a feature that you can use to create anomaly detection alarms on single metrics and the outputs of metric math expressions
- Example: divide the Lambda Errors metric by the Lambda Invocations metric to get an error rate
CloudWatch Logs
- Centralizes Logs for applications (e.g. Apache logs) systems logs (e.g. EC2) AWS Services (e.g. Route53, CloudTrail, ...)
- View, Search, Filter. Search based on error code and messages (e.g. 404 status in Apache logs)
- Notifications. Receive a notification whenever the rate of errors exceeds a threshold you specify
- Monitor Log Files - Monitor and troubleshoot your app using existing system and app log files
- Customize for you application - Monitor your logs in near real-time for specific phrases, values or patterns. To do this you need to use the CloudWatch Agent
Terminology:
- Log Event - Event message and timestamp
- Log Streams - Sequence of log events from the same source, e.g. an apache log from a specific host. Must belong a Log Group
- Log Group - Group Log Streams together, centrally manage retention, monitoring and access control settings. No limit on the number of log streams in a log group
Example: 2 EC2 running apache, each instance send event log as part of its log stream. These log streams are part of the same log group to centrally control access control, retention ...
Retention Settings:
- By default logs are kept indefinitely
- You can set your retention period from 1 day - 10 years
- Expired log events are automatically deleted
- Retention settings can be applied to an entire Log Group
Metric Filter:
- Monitor events in a Log Group as they are sent to CloudWatch Logs. You can monitor and count specific terms or extract values from log events and associate the results with a metric. Filter for Warning, Errors, HTTP status codes, etc.
- Create Metric Filter at the Log Group level for any Log Stream
- Create Filter Pattern: when a metric filter matches a term, it increments the metric's count. For example, you can create a metric filter that counts the number of times the word ERROR occurs in your log events.
- Metrics details:
- Metric Namespaces: let you group similar metrics
- Metric Name: identifies this metric, and must be unique within the namespace
- Metric Value: is the value published to the metric name when a Filter Pattern match occurs. Valid metric values are: floating point number (1, 99.9, etc.), numeric field identifiers ($1, $2, etc.), or named field identifiers (e.g. $requestSize for delimited filter pattern or $.status for JSON-based filter pattern - dollar ($) or dollar dot ($.) followed by alphanumeric and/or underscore (_) characters).
- Default Value (optional): is published to the metric when the pattern does not match. If you leave this blank, no value is published when there is no match
- Unit (optional)
CloudWatch Insight
- Interactive query and analysis for data stored in CloudWatch Logs
- Query and filter logs directly
- Generate visualization e.g. bar graph, line graph or pie chart
Run queries to:
- filter logs (with basic bar chart on top of results)
- create visualization charts (line, stacked area, bar and pie)
- export results in various format (markdown, CSV, JSON, CSV, XLSX) either by copying to clipboard or downloading table
- add your result to a dashboard
Examples:
- Lambda
- View latency statistics for 5-minute intervals
- Determine the amount of overprovisioned memory
- Find the most expensive requests
- VPC Flow Logs
- CloudTrail
- Common queries
- 25 most recently added log events
- Number of exceptions logged every 5 minutes
- Route53
- AWS AppSynch
- NAT Gateway
CloudTrail
- Records user activity (AWS API) in your AWS Accounts
- Log both Console and AWS CLI, but not SSH or RDP
- Enabled by default
- Support almost every AWS services, unsupported services here mostly because they do not have public APIs
- Logs: who, when, what, where, source IP, parameters and response
- By default Event History keeps log for 90 days
- Organization trail logs all events for all AWS accounts in an organization (must be created in the management account)
- Enable for all accounts in my organization options (management or delegated administrator account only)
- Choose a bucket belonging to any account, but the bucket policy must grant CloudTrail permission to write to it
Use Case:
- Incident Investigation: after-the-fact investigation
- Security Analysis: near-real-time security analysis of user activity
- Compliance: can be used to help you meet industry, regulatory compliance and audit requirements
Keeping logs longer than 90 days:
- Create a trail: when you create a trail in the console, logs are saved indefinitely to an S3 bucket
- Secure by Default: Encrypted using Server Side Encryption. Log integrity validation means logs are digitally signed, so you can detect if a log was changed or deleted
- All Regions: by default (in AWS Console), a trail created in the console will apply to all regions (recommended). Use AWS CLI to log events in a single region
Near Real-Time:
- After making an API call, it can take up to 15 minutes for the call to appear in CloudTrail
- CloudTrail publishes logs to S3 approximately every 5 minutes
- Overall it can take between 15 to 20 minutes for a call to appear in the logs
Creating a Trail:
- Storage location: New or Existing S3 bucket
- Log file encryption is Enabled by default with SSE-KMS and you need to provide a New or Existing KMS key. If disabled, SSE-S3 it is used (with SSE-KMS you encrypt logs but not digest that are encrypted with SSE-S3)
- Log file validation is Enabled by default
- SNS notification is Not Enabled by default, when Enabled requires a New or Existing SNS topic. This for provide notification for every log file delivery, not for every event
- CloudWatch Logs to monitor your trail logs and notify you when specific activity occurs, by default Not enabled when Enabled requires a New or Existing Log Group
- CloudTrail supports sending data, CloudTrail Insights, and management events to CloudWatch Logs
- You can then define metric filters and alerts
- Events:
- Management events: capture management operations performed on your AWS resources (options: Read, Write, Exclude AWS KMS events, Exclude Amazon RDS Data API events)
- Data events: log the resource operations performed on or within a resource (data plane operations)
- DynamoDB: PutItem, DeleteItem, and UpdateItem on Table
- S3: GetObject, DeleteObject, and PutObject on buckets and objects in buckets
- AWS Lambda function execution activity (the Invoke API)
- Insights events: identify unusual write activity, errors, or user behavior in your account (API call rate, API error rate)
AWS Config
Dashboards:
- Resource Inventory: for AWS and non-AWS resources
- Compliance Status:
- Rules (compliant and non-compliant)
- Resources (compliant and non-compliant)
- Noncompliant rules by noncompliant resource count
Example Use Cases:
- EC2 must not have public IP, discovers noncompliant instances, perform automatic remediation (e.g. stop the non-compliant instance)
- Configuration Monitoring: continuous monitor (you can trigger rule re-evaluation)
- Desired State: evaluate configuration against
- Notification: sends event to EventBridge (default) and SNS (remediation action) if a resource deviates from desired state
- Automatic Remediation: triggers action (SSM) that you define against non-compliant resources
- Change History: stored into S3 bucket created for us
- Integrated with:
- IAM
- EC2
- EBS
- ELB
- CloudFormation
- CloudFront
- CloudTrail
- KMS
- RDS
- S3
- Security Groups
- SNS
- VPC
Terminology:
- Rule: the desired configuration for a specific resource
- Managed Rules: 180 AWS provided managed rules for pre-defined common best practices. Examples:
- s3-bucket-public-read-prohibited
- desired-instance-type
- cloud-trail-encryption-enabled
- ec2-ebs-encryption-by-default
- required-tags
- Conformance Pack: a set of rules and remediation actions that can be deployed and managed as one. Examples:
- Operational Best Practices for S3
- Operational Best Practices for EC2
- Operational Best Practices for IAM
- Operational Best Practices for PCI DSS
- Operational Best Practices for AWS Well-Architected Framework Security Pillar
Configure Rule:
- Evaluation Mode: determines when resources will be evaluated
- Proactive evaluation: pre-provisioning
- Detective evaluation: post-provisioning
- Both
- Trigger type:
- When configuration changes
- Periodic: evaluates resources when the trigger occurs
- Scope of changes:
- All changes
- Resources (Resource type, identifier)
- Tags
- Parameters: (key and value) define attributes for which your resources are evaluated; for example, a required tag or S3 bucket
- General settings (Recorder)
- Resource types to record
- Record all current and future resources supported in this region (optionally include global resources)
- Record all current and future resource types with exclusions
- Record specific resource types
- Data retention period
- Retain AWS Config data for 7 years (2557 days)
- Set a custom retention period for configuration items recorded (30 days - 7 years)
- AWS Config role
- Use an existing AWS Config service-linked role
- Choose a role from your account
- Delivery method
- Amazon S3 bucket (New, Existing, other Account)
- Amazon SNS topic (New, Existing, other Account) - Stream configuration changes and notifications to an Amazon SNS topic
- CloudWatch Events - AWS Config sends detailed information about the configuration changes and notifications to Amazon CloudWatch Events
Remediation with SSM
-
-
Other Remediation Action (under the hood still use SSM)
- Notification publishing to SNS topic
- Delete unused resources (e.g. EBS, Elastic IP, SG, ...)
- Enable Encryption on a S3 buckets
- Disable Public Access for a Security Group
Remediation Action
- Select rule > Action > Manage remediation
- Select remediation method
- Automatic remediation: the remediation action gets triggered automatically when the resources in scope become noncompliant
- Manual remediation: manually choose to remediate the noncompliant resources
- Remediation action details: the execution of remediation actions is achieved using SSM
- Rate Limits: specify the percentage of resources against which SSM documents are executed at a time and also the percentage of failed SSM executions for which the entire batch is marked as failed
- Parameter: each parameter has either a static value or a dynamic value. If you choose a parameter from the Resource ID dropdown list, the RESOURCE_ID value is passed to the selected parameter. You can enter values for all the other keys. If you do not choose a parameter from the Resource ID dropdown list, you can enter values for each key. Example: The ARN of the role that allows Automation to perform the actions on your behalf
- Resource ID parameter: pass the resource ID of noncompliant resources to a remediation action by choosing a parameter that is dependent on the resource type. The parameters available in the dropdown list depend on the selected remediation action
Aggregtor
An aggregator is an AWS Config resource type that collects AWS Config configuration and compliance data from the following:
- Multiple accounts and multiple regions
- Single account and multiple regions
- An organization in AWS Organizations and all the accounts in that organization which have AWS Config enabled
- Use an aggregator to view the resource configuration and compliance data recorded in AWS Config
- Aggregators provide a read-only view into the source accounts and regions that the aggregator is authorized to view
- Aggregators do not provide mutating access into the source account or region. For example, this means that you cannot deploy rules through an aggregator or pull snapshot files from the source account or region through an aggregator
Multi-account and Multi-region data aggregation in AWS Config allows you to aggregate AWS Config configuration and compliance data from multiple accounts and regions into a single account. Useful for central IT administrators to monitor compliance for multiple AWS accounts in the enterprise
EventBridge
Schedule Event this is used to run a Rule on a schedule (e.g. reboot an instance every month at the same time)
CloudWatch Event is now EventBridge. EventBridge is the preferred way to manage your events. CloudWatch Events and EventBridge are the same underlying service and API, but EventBridge provides more features. Changes you make in either CloudWatch or EventBridge will appear in each console
Event-driven architecture (event = change in state). Various AWS services sends events to EventBrdige that match these events in Rules that route to Targets that take Actions (e.g. shutdown an EC2 that is marked non-compliant, trigger a Lambda function to take action in response to an event or send a SNS notification when a certain event is found in CloudTrail)
Use Case (not good examples):
- AWS Config detects EC2 with unencrypted EBS volume. An event is generated and sent to EventBridge, which triggers a Rule that invokes an action to send you an email using SNS
- CloudWatch detects an EC2 with 99% CPU Utilization, an event is generated and sent to EventBridge, which triggers a rule that invokes an action to send you an email using SNS
Key Concepts
EventBridge Rule
-
Rule detail
- Event Bus default vs custom, receive events from resources that emit events: 1) AWS services in your account or in other accounts, 2) SaaS partner services and 3) applications, and your own custom applications. When an event bus receives an event, EventBridge then checks whether the event matches the conditions of the rules associated with that event bus
- Rule type
- Rule with an event pattern
- Schedule
- Cron expression: a fine-grained schedule that runs at a specific time, such as 8:00 a.m. PST on the first Monday of every month
- Rate expression: a schedule that runs at a regular rate (Minutes, Hours, Day), such as every 10 minutes
- Event source
- AWS events or EventBridge partner events - Events sent from AWS services or EventBridge partners
- Other - custom events or events sent from more than one source, e.g. events from AWS services and partners
- All events - all events sent to your account
- Creation method - Event Pattern
- Use schema - use an Amazon EventBridge schema to generate the event pattern
- Select schema from Schema registry
- Enter schema
- Use pattern form - use a template provided by EventBridge to create an event pattern
- Event source - AWS service or EventBridge partner as source
- AWS service - the name of the AWS service as the event source
- Event type - the type of events as the source of the matching pattern
- Custom pattern (JSON editor) - write a pattern in JSON
- Target(s)
- Target types
- EventBridge event bus
- EventBridge API destination - HTTP endpoints that you can invoke
- AWS service
EventBridge Pipes
-
- Sources: receive events from a variety of sources, including DynamoDB, Kinesis, and SQS
- Filtering: define an event pattern to filter the events that are sent through the pipe
- Enrichment: transform your event or pull additional data into it using Lambda, Step Functions, or an API
- Target: send your event to an AWS service, an event bus, or an API destination
EventBridge Schedule
- A schedule invokes a target one-time or at regular intervals defined by a cron or rate expression
- A new EventBridge scheduling functionality that provides one-time and recurring scheduling functionality independent of Event buses and rules. You can create a schedule to invoke targets such as a Lambda function
Schedule pattern:
- One-time schedule
- Recurring schedule (Cron vs Rate such as Scheduled Rules)
-
Rules:
- A single rule can route to up to 5 targets, all of which are processed in parallel
- A rule can customize an event before it is sent to the target, by passing only certain parts or by overwriting it with a constant
Targets:
- There are over 15 AWS services available as event targets including: Lambda, SQS, SNS, Kinesis Streams, Kinesis Data Firehose, SSM, Step Function
- To deliver event data to a target, EventBridge needs permission to access the target resource
EventBridge Archive and Replay Events
- create an archive of events so that you can easily replay them at a later time
- determine which events are sent to the archive by specifying an event pattern
- retention period
Fail to send events to Target:
- Retry: by default, when an event is not received (by the target) retry sending the event for 24 hours and up to 185 times
- Dead-letter queue: to avoid losing events after they fail to be delivered to a target, configure a dead-letter queue and send all failed events to it
AWS Health Dashboard
- Service health - view the current and historical status of all AWS services
- Open and recent issues - View the current and historical status of all AWS services
- Service history - any AWS service issue in the last 12 months
- (Personal) Your account health with important events affecting your AWS resources
- Open and recent issues
- Scheduled changes - upcoming events and ongoing events from the past seven days that might affect your AWS infrastructure, such as scheduled maintenance activities
- Other notifications - ongoing events from the past seven days that might affect your AWS account, such as certificate rotations, billing notifications, and security vulnerabilities
- Event log
Organization Health - use the AWS Health Dashboard to get a centralized view for health events in your AWS organization
Integrations:
- Amazon EventBridge
- AWS Health Aware (customize AWS Health Alerts for Organizational and Personal AWS Accounts)
Service to Review:
- CloudWatch
- CloudTrail
- SSM [Session Logging?]
- EventBridge
- Config
- Health Dashboard