Please enable JavaScript.
Coggle requires JavaScript to display documents.
Containers - Coggle Diagram
Containers
ECS
Cluster
Infrastructure
- Fargate Only (Common workloads)
- Fargate and Managed InstancesManaged instances (Advanced workloads)
- ECS manage patching and scaling
- Configurability about the types of instances
- Key settings:
- Instance profile: IAM Role required to access AWS services (ECS, ECR, CloudWatch Log, SSM)
- Infrastructure role: IAM Role to manage the Amazon ECS Managed Instances lifecycle (launch/terminate instance, apply patches, etc.)
- Instance Selection:
- ECS Default (Recommended): ECS choose instance type based upon ECS Task definition and ECS Service requirements
- Use custom: you specify instance attributes (vCPU, RAM, etc.) or exact instance types
- Fargate and Self-managed instances
- you have full control over the instances
- you patch and scale instances
- Create / Use existing ASG (with on-demand or spot instances)
- EC2 instance role
Scaling
ASG
- You manage standard ASG (not container aware)
- EC2 only
ECS Cluster Capacity Provider
- Controls underlying ASG
- Uses specialized metric CapacityProviderReservation make scaling decision
- Prevent ASG terminate an instance that is still running active ECS task during a scale-in event
- Types:
- FARGATE
- FARGATE_SPOT (AWS can terminate your tasks with a 2-minute warning if they need the capacity back)
- ASG
Storage
- Host-Centric (legacy):
- attach EBS volume / mount EFS to EC2 instance, and then use "Bind Mounts" in your task to point to a specific folder on that host
- Problem:
- Your task became "locked" to that specific EBS/instance. If the instance died, the task couldn't move easily
- If you had 10 different EFS filesystems for 10 different apps, your EC2 hosts became cluttered with mount points
- Task-Centric:
- You define the volume inside the Task Definition
- When ECS schedules the task, it handles the "heavy lifting" of attaching and mounting the storage to whichever EC2 instance happens to be running the task
Monitoring
-
CloudWatch Agent
CW Agent Deployment Models
- Sidecar (only option for Fargate. Launch)
- Fargate/EC2
- Every task has its own agent container. Best for task-level metrics and Fargate workloads
- Daemon (best for EC2 Launch)
- EC2 Only
- One agent per EC2 instance. It collects metrics from all tasks on that host.
- Most cost-effective for large EC2 clusters.
CW Agent DeploymentSidecar:
- IAM Role:
- Task Role: Needs CloudWatchAgentServerPolicy to allow the agent to send metrics and logs while running
- Task Execution Role: Needs CloudWatchAgentServerPolicy and ssm:GetParameters to pull agent config from SSM Parameter Store
- Store Agent config SSM Parameter Store
- Update the Task Definition
Add the CloudWatch agent as a second container in your Task Definition
Daemon:
- IAM Role:
- Task Role: Attach the managed policy CloudWatchAgentServerPolicy. This allows the agent to push data to CloudWatch
- Task Execution Role: Needs CloudWatchAgentServerPolicy and ssm:GetParameters to pull agent config from SSM Parameter Store
- Store Agent config SSM Parameter Store
- Create the "Daemon" Task Definition
- Deploy as an ECS Service with "Daemon" Strategy (instead of Replica)
Task
Task Definition - key configurations:
- Launch Type (Fargate, Managed Instances, slef-managed EC2 Instances) you can have multiple selection
- OS/Architecture
- Network Mode:
- awsvpc:
- Provides the Task with an ENI, you need to specify VPC, Subnet, SG
- Required for Fargate
- bridge:
- Uses Docker's built-in virtual network
- The bridge is an internal network namespace that allows containers connected to the same bridge to communicate.
- The bridge provides isolation from containers that aren't connected to the same bridge network
- You use static or dynamic port mappings to map ports in the container with ports
- host:
- Task bypass Docker's built-in virtual network
- Maps container ports directly to the ENI of EC2 instance
- Can't run multiple instances of the same task on a single EC2 instance when port mapping is enabled
- none:
- Task has no external network connectivity
- default:
- Windows only
- Uses Docker's built-in virtual network mode on Windows
- Task size (CPU and Memory)
- Task role: IAM Role used by Task to access AWS Services
- Task execution role: IAM Role used by the ECS Agent (on EC2) or Fargate to make API's call on your behalf
- Task placement constraints
- Container details:
- Image URI
- Port Mapping:
- awsvcp: Container Port, Protocol (TCP/UDP), App Protocol (HTTP, GRPC, ...)
- bridge: Host Port, Container Port, Protocol, App Protocol
- host: Container Port, Protocol (TCP/UDP), App Protocol
- Environment variables:
- plain text, SSM Parameter, Secret Manager
- Add from file - environment variables in bulk from environment file in S3
- Logging
- awslogs (default): sends container logs to CloudWatch Log Group ecs/<task-definiton-name>
- awsfirelens (modern choice)
- Route to: S3, Kinesis, Firehose, OpenSearch and custom destinations
- Provides JSON transformation, filtering, route to multiple destination, buffering
- Need to add a Fluent Bit (recommended) or Fluentd container in your Task definition
- Use Fluent Bit / Fluentd with ECS Agent automatically generate a configuration file and "link" your app and log-router containers
- sidecar that intercepts your container's stdout and stderr
- splunk
- Third-Party & Legacy Drivers (EC2 Only):
- syslog / journald / gelf / fluentd / json file
- Storage
- Ephemeral storage:
- Fargate and EC2 launch type
- Fargate provides 20GiB (default) up to 200GiB
- Data Volume
- Configure at task definition creation
- Bind Mount: based on EBS (with Fargate you cannot configure Host Path, allows sharing among Task's containers)
- EFS
- Configure at deployment
- Configure 1 EBS Volume when creating or updating a service
Bind Mount
- Bind Mount is a way to map a specific directory from the "host" into your container
- EC2 Launch:
- maps a path in EC2 to a path in container
- persist on EC2 after Task end
- Use cases: shared storage among containers, send logs, fast loca caching
- Fargate Launch:
- Temporary shared "scratch pad" that exists only for the life of the task
- Sharing files between containers in the same task
Troubleshooting
Task Issues
You run the task and the task displays a PENDING status and then disappears
- Application or configuration errors
- Check ECS service event for ECS stopping and replaces a task because the containers in the task have
- Stopped running
- Failed too many health checks from
Task is stuck in the PENDING state
- Container agent cannot download the Docker image from ECR or Docker Hub
- No Route to ECR: If your task is in a private subnet, it needs a NAT Gateway or ECR VPC Endpoints to pull images
- Security Groups: Ensure your task's security group allows outbound HTTPS (Port 443) traffic to reach the image repository
- IAM Permission Issues (Task Execution Role)
- Missing ECR Permissions: The role must have ecr:GetDownloadUrlForLayer and ecr:BatchGetImage
- Secrets Manager or SSM Parameter: If Task definition pulls environment variables from Secrets Manager or SSM Parameter Store, the Execution Role must have permissions to read those specific secrets. If it can't, the task stays PENDING
Service Auto Scaling
- Increasing or decreasing the desired_count of tasks in your ECS Service.
- It uses Application Auto Scaling and offers three main strategies:
- Target Tracking
- Step Scaling
- Scheduled Scaling
EKS
Infrastructure
EKS Cluster (Control Plane)
- Cluster IAM Role
- Node IAM Role
- VPC
EKS Nodes
EKS Auto Mode
- Auto Mode handles the entire lifecycle of EC2 instances
- Based on integrated AWS managed Karpenter (no ASG)
- OS is only Bottlerocket
Managed Node Groups
- One or more Amazon EC2 instances running the latest EKS-optimized AMIs
- All nodes provisioned as part of an ASG
- Use launch templates to customize the configuration
Self-Managed Nodes
- Nodes created by you and registered to the EKS cluster and managed by an ASG
- You can use prebuilt AMI - Amazon EKS Optimized AMI
Fargate
- Each Pod runs in its own isolated compute environment (a micro-VM)
- You cannot run DaemonSets
Data Volume
- Leverages a Container Storage Interface (CSI) compliant driver
- CSI act as the "plug-in" between EKS and AWS storage
- Need to specify StorageClass manifest on your EKS cluster
Supports for:
- EBS
- EFS
- FSx for Lustre (HPC/AI)
- FSx for NetApp ONTAP (Enterprise)
- S3 (S3 CSI Driver)
How Volumes are Provisioned -K8s uses three main abstractions to manage these volumes:
- PersistentVolume (PV): The physical storage resource in your AWS account
- PersistentVolumeClaim (PVC): A "request" for storage made by a pod. Specify size and access mode, and K8s finds or creates a PV to match
- StorageClass (SC): The "template" for storage. It defines which CSI driver to use and parameters like EBS volume type (gp3, io2)
EKS Auto Mode (storage management is greatly simplified)
- Built-in CSI Drivers: (EBS or EFS CSI drivers) You don't need to manually install or patch CSI drivers; AWS manages them as part of the cluster lifecycle
- Managed EBS: When your pod needs an EBS volume, EKS Auto Mode ensures the node it launches is in the same Availability Zone as the volume, preventing the common "Multi-AZ scheduling" error
Storage options for Fargate:
- Ephemeral
- EFS
- S3 (S3 CSI Driver)
AutoScaling Modes
- Cluster AutoScaler (legacy)
- Automatically adjusts the number of nodes (same size) in your cluster using ASG
- Karpenter
- Launching right-sized compute resources (EC2 instances, Fargate) in response to load changes in < 1 minute
- Changes instance-types
- Auto Mode (AWS-managed Karpenter)
- Automatically scales cluster compute resources with ability to consolidate workloads and delete node
- Changes instance-types
Monitoring
Control Plane Logging
- Logs are not enabled by default (except in Auto Mode)
- When you enable control plane logging, the logs are sent to CloudWatch Logs
- EKS Control Plane Log Types
- API Server (api)
- Audit (audit)
- Authenticator (authenticator)
- Controller Manager (controllerManager)
- Scheduler (scheduler)
- Ability to select the exact log types to send to CloudWatch Log
Nodes Logging
- Node logs provide visibility into the health of your EC2 instances (Kubelet, Container Runtime, OS system logs)
- Managed & Self-Managed Nodes (EC2):
- Use the CloudWatch Agent or Fluent Bit deployed as a DaemonSet
- EKS Auto Mode:
- AWS automatically collects system logs from the Bottlerocket OS and sends them to CloudWatch
- No need to install or manage an agent
- Fargate:
- Node logs are not accessible because the nodes are fully managed and abstracted away by AWS
Container Logging
EC2/Auto Mode (Recommended)
- AWS for Fluent Bit
- The industry standard is to run Fluent Bit as a DaemonSet
Fargate
- Fargate Built-in Log Router
- You cannot run DaemonSets on Fargate
- AWS provides a built-in log router based on Fluent Bit
- No need of sidecar container in your Pods
CloudWatch Observability Add-on
- AWS offers a consolidated EKS Add-on that installs the CloudWatch Agent and Fluent Bit
- It sets up "Container Insights," which gives you pre-built dashboards in CloudWatch that correlate your logs with CPU/Memory
-
- EKS Cluster Insights – detect issues and provide recommendations
- Configuration Insights – identifies misconfiguration in your EKS Cluster (hybrid)
- Upgrade Insights – identifies issues that could impact your ability to upgrade to new Kubernetes version
Upgrade
1) Review Upgrade Insights in EKS Cluster Insights to identify any issues
2) Update Cluster Control Plane
3) Update Cluster Components (e.g. Nodes)
Workload Scalability
Horizontal Pod Autoscaler (HPA)
- Adds or removes Pod replicas
- Monitors CPU/Memory (via Metrics Server) and scales out when usage hits a threshold
- Best for: Most stateless web applications and microservices
Requirements:
- Kubernetes Metrics Server to provide CPU and memory data
- You must define "resources: requests" in your container's deployment manifest
- Configured kubectl and API Access
- RBAC Permissions: user or service account creating HPA needs permissions to read metrics and update the scale subresource of the target (Deployment, StatefulSet, etc.)
- API Version: ensure you are using autoscaling/v2 API version in your manifests for scaling based on memory or multiple metrics
Vertical Pod Autoscaler (VPA)
- Increases or decreases the CPU/Memory requests of existing Pods
- Watches actual usage over time and "rightsizes" Pods
- Best for: Stateful apps (databases) or "monolithic" apps that cannot be easily cloned
- Warning: by default, VPA will restart Pods to apply changes
Requirements:
- Metric Source
- Metrics Server: This is the minimum requirement. It provides short-term metrics (CPU/Memory usage)
- Prometheus (Optional but Recommended): VPA can get data from Prometheus for longer-term historical analysis for more precise recommend.
- Deploy VPA as an add-on (VPA Recommender, VPA Updater, VPA Admission Controller)
- Cluster Configuration
- MutatingAdmissionWebhook: K8s API server must have this admission controller enabled, enabled by default
- Resource Requests: Pods must have some initial "resources: requests" defined in their YAML. VPA uses these as a starting point
- Warnings:
- You cannot use VPA and HPA on the same resource (e.g., both scaling on CPU)
Kubernetes Event-Driven Autoscaling (KEDA)
- Scales based on external events, not just CP
- Scale Pods based on the number of messages in SQS, Kafka lag, or specific time of day (Cron)
- Best for: Background workers, queue processing, and event-driven architectures
Requisites:
- Cluster & Tools
- kubectl Configuration: your local machine must have kubectl installed and configured to communicate with your cluster
- Helm v3
- Target Resource: Deployment or StatefulSet
- Like HPA you should define "resources: requests" in your Pod spec (KEDA scales on external event, but creates HPA that needs this)
- Network Connectivity to event sources
- IAM / Secret Permissions to access event source
-