Amazon CloudWatch¶
1. What is CloudWatch?¶
CloudWatch is AWS's unified observability service — it collects metrics, logs, and traces from your AWS infrastructure and applications, then lets you monitor, alert, visualize, and automatically respond to operational events.
Three pillars of observability:
Metrics → numbers over time (CPU = 78%, RequestCount = 4,320)
Logs → text records of events (application logs, access logs, Lambda output)
Traces → request path across distributed services (X-Ray integration)
CloudWatch covers: Metrics + Logs
AWS X-Ray covers: Traces (distributed tracing)
2. CloudWatch Metrics ⭐¶
A metric is a time-series of data points with a name, namespace, and optional dimensions.
Metric anatomy:
Namespace: AWS/EC2 ← service grouping (or custom: MyApp/Orders)
MetricName: CPUUtilization ← what is measured
Dimensions: InstanceId=i-1234567890abcdef0 ← what it applies to
Unit: Percent
Value: 78.3
Timestamp: 2026-04-08T15:30:00Z
Default vs Custom Metrics¶
| Type | Published By | Cost | Examples |
|---|---|---|---|
| AWS default | AWS services automatically | Free | EC2 CPUUtilization, S3 BucketSizeBytes, ALB RequestCount |
| Detailed monitoring | AWS services (1-min granularity) | Small charge | EC2 with detailed monitoring enabled (default: 5-min) |
| Custom metrics | Your application via PutMetricData API | Per metric per month | OrderCount, ActiveUsers, QueueDepth |
EC2 Default Metrics (Free, 5-minute granularity)¶
AWS/EC2 namespace:
CPUUtilization ← % CPU used
NetworkIn / NetworkOut ← bytes transferred
DiskReadOps / DiskWriteOps ← disk operations
StatusCheckFailed ← instance or system check failure
NOT available by default (need CloudWatch Agent):
Memory (RAM) utilization ← OS-level, AWS cannot see it
Disk space utilization ← OS-level, AWS cannot see it
Process counts ← OS-level
CloudWatch Agent:
Installed on EC2 → reads OS-level metrics → sends to CloudWatch
IAM role required: CloudWatchAgentServerPolicy
Config: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Metric Resolution¶
Standard resolution: 1 minute (minimum for detailed monitoring)
High resolution: 1 second (custom metrics only — higher cost)
Retention:
< 60 seconds resolution: retained 3 hours
1-minute resolution: retained 15 days
5-minute resolution: retained 63 days
1-hour resolution: retained 455 days (15 months)
(CloudWatch automatically aggregates older data to lower resolution)
Metric Math¶
Combine metrics with mathematical expressions into new virtual metrics:
Example: Error rate % from two metrics
m1: HTTPCode_ELB_5XX_Count
m2: RequestCount
Expression: (m1/m2)*100 → ErrorRate %
→ Create alarm on ErrorRate instead of raw counts
Functions available:
METRICS() → array of all metrics in expression
SUM(METRICS()) → sum across all dimensions (e.g., total CPU across fleet)
AVG(METRICS()) → average across all dimensions
MIN/MAX → minimum/maximum of metrics
ANOMALY_DETECTION_BAND(m1, 2) → ML band for anomaly detection [docs.aws.amazon](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html)
3. CloudWatch Alarms ⭐¶
An alarm watches a metric (or metric math expression) and transitions between states based on whether the metric crosses a threshold:
Alarm States¶
OK → metric within acceptable range
ALARM → metric exceeded threshold → actions triggered
INSUFFICIENT_DATA → not enough data points yet (new alarm or metric gap)
Alarm Components¶
Metric: what to watch (CPUUtilization of i-1234567890abcdef0)
Threshold: > 80%
Period: 300 seconds (5 minutes) — data point interval
Evaluation: 3 out of 3 periods → must breach 3 consecutive periods
Datapoints: N of M evaluation (e.g., 3 out of 5 periods)
Alarm Actions¶
| Action | Use Case |
|---|---|
| SNS notification | Email, SMS, PagerDuty, Slack (via Lambda) |
| EC2 action | Stop, Start, Terminate, Reboot instance |
| Auto Scaling | Add/remove instances from ASG |
| Systems Manager | Run automation runbook |
| Lambda | Custom remediation logic |
# Alarm → SNS → Lambda → auto-remediate
Alarm: EC2 CPU > 90% for 3 periods
→ SNS topic: ops-alerts
→ Email: on-call engineer
→ Lambda: snapshot instance + send PagerDuty alert
Composite Alarms¶
Multiple alarms combined with AND/OR logic:
ALARM if: (CPU > 90% AND Memory > 85%) ← avoid false positives
ALARM if: (5xx errors > 100 OR latency > 5s)
Benefits:
Reduce alert noise — only alarm when multiple signals agree
Create a "service health" rollup alarm from individual metric alarms
Anomaly Detection Alarms ⭐¶
Instead of static threshold → use ML-learned band of expected values:
CloudWatch trains model on 2 weeks of metric history
→ Learns hourly, daily, weekly patterns
→ Creates expected value band (configurable width in std deviations)
Alarm triggers when metric goes OUTSIDE the band:
ANOMALY_DETECTION_BAND(m1, 2) ← 2 = number of standard deviations
Example:
Normal traffic: 1,000–5,000 req/min (varies by time of day)
Static threshold alarm: > 8,000 → many false positives on peak hours
Anomaly detection: > expected range for this time of day → accurate
Use cases:
Traffic spikes that are unusual for the current time
Lambda duration deviating from normal
Error rates deviating from baseline
Business metrics (orders/minute) dropping unexpectedly
4. CloudWatch Logs ⭐¶
Log Hierarchy¶
Log Group → container for a service/application
└── Log Stream → sequence of events from one source (one EC2, one Lambda)
└── Log Events → individual timestamped log entries
Log Sources¶
| Source | How Logs Get to CloudWatch |
|---|---|
| Lambda | Automatic — every Lambda writes to /aws/lambda/<function-name> |
| EC2 | CloudWatch Agent required |
| ECS/EKS | awslogs log driver / Fluent Bit |
| API Gateway | Enable access logging in stage settings |
| VPC Flow Logs | Enable on VPC/subnet/ENI → destination CloudWatch |
| CloudTrail | Enable CloudWatch Logs integration on trail |
| RDS | Enable enhanced monitoring + slow query logs |
| Load Balancer | Enable access logs (goes to S3) — not native CW |
Log Retention¶
Default: logs kept forever (never expire) — accrues cost indefinitely
Set retention: 1 day, 3 days, 5 days, 1 week, 2 weeks, 1/3/6 months, 1/2/5/10 years
Best practice:
Dev log groups: 7 days
Prod app logs: 90 days → then archive to S3 via subscription
Audit logs: 1–7 years (compliance)
Metric Filters¶
Extract metric values FROM log data:
Log group: /aws/lambda/order-processor
Filter pattern: [timestamp, requestId, level="ERROR", ...]
→ Creates metric: LambdaErrors (count of matching lines)
→ Create alarm on this metric
Example patterns:
"ERROR" ← any line containing ERROR
"[ERROR]" ← literal [ERROR]
{ $.statusCode = 500 } ← JSON log: statusCode field = 500
{ $.latency > 1000 } ← JSON log: latency > 1000ms
5. CloudWatch Logs Insights ⭐¶
Interactive query engine for searching and analyzing log data using a purpose-built query language:
-- Most common query pattern: find errors in last 1 hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
-- Top 10 slowest Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| sort @duration desc
| limit 10
-- Error rate percentage
fields @timestamp, @message
| stats
count(*) as totalRequests,
sum(@message like /ERROR/) as errors
| project errors/totalRequests * 100 as errorRate
-- Parse custom log format
fields @message
| parse @message "user=* action=* duration=*ms" as user, action, duration
| stats avg(duration) by action
| sort avg_duration desc
-- Lambda cold starts
fields @timestamp, @initDuration
| filter @initDuration > 0
| stats count() as coldStarts, avg(@initDuration) as avgInitMs
-- VPC Flow Log: top talkers
fields srcAddr, dstAddr, bytes
| stats sum(bytes) as totalBytes by srcAddr, dstAddr
| sort totalBytes desc
| limit 10
Key commands:
fields → select specific fields to return
filter → where clause (like SQL WHERE)
stats → aggregate: count, sum, avg, min, max, percentile
sort → order results
limit → cap result count
parse → extract fields from unstructured log text
dedup → remove duplicates by field
display → rename/format fields in output
Time range: query logs from any time window (up to retention period)
Supports: all log groups simultaneously (cross-log-group query)
6. CloudWatch Logs — Subscriptions and Export¶
Subscription Filter (Real-time Streaming)¶
Stream logs in real-time to:
Lambda → process and forward to Elasticsearch/Splunk/custom
Kinesis Data Streams → high-volume real-time processing
Kinesis Firehose → buffer and deliver to S3/Datadog/Splunk
Use case:
/aws/lambda/api → subscription filter → Kinesis → Firehose → S3
→ Centralized log archive at low cost
S3 Export (Batch)¶
Export historical log data to S3:
CreateExportTask API → exports to S3 (takes minutes to hours)
Not real-time — up to 12 hours delay
Use for: compliance archiving, bulk analysis in Athena
7. CloudWatch Agent ⭐¶
Required for OS-level metrics and custom application logs from EC2:
Collects:
Memory utilization (RAM)
Disk space per mount point
Disk I/O per device
Network metrics per interface
Process metrics
Custom app logs (tail any log file)
StatsD / collectd metrics from applications
Install:
SSM Run Command (recommended for fleets)
Or manually: yum install amazon-cloudwatch-agent
Configure:
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Or use: aws-cloudwatch-agent-config-wizard (interactive)
IAM role needed:
CloudWatchAgentServerPolicy (attach to EC2 instance profile)
8. CloudWatch Dashboards¶
Global service — view metrics from any region on one dashboard
Share with: AWS accounts, email (public link), third parties (no AWS login)
Widget types:
Line chart → trends over time
Number → current metric value (CPUUtilization: 34%)
Gauge → visual meter
Bar chart → comparison
Text → markdown annotations and runbook links
Alarm status → traffic light for multiple alarms
Log table → embedded Logs Insights query result
Explorer → auto-discover and graph tagged resources
Automatic dashboards:
AWS creates default dashboards for each service → CloudWatch → Dashboards → Automatic
EC2, Lambda, RDS, ELB etc. — pre-built, no configuration needed
9. CloudWatch Synthetics (Canary Monitoring)¶
Runs scripted synthetic transactions against your application endpoints
→ Detects issues BEFORE real users do
→ Monitors availability and performance from outside your system
Canary types:
Heartbeat monitor: GET your URL → check 200 response
API canary: series of API calls → validate responses
Broken link check: crawl pages → find broken links
GUI workflow: Puppeteer script → simulate user login/checkout
Visual monitoring: screenshot comparison (detects UI regressions)
Schedule: every 1 minute to every 1 hour
Stores: screenshots, HAR files, Lambda execution logs in S3
Metrics: SuccessPercent, Duration → create alarms on them
Use case:
Alarm: canary SuccessPercent < 100% for 3 minutes
→ Trigger PagerDuty before your monitoring team notices
10. CloudWatch ServiceLens and X-Ray Integration¶
ServiceLens = CloudWatch Metrics + Logs + X-Ray Traces unified view
→ Map of microservices with health, latency, error rate per service
→ Click any service → drill into its traces
→ Click any trace → see which Lambda/EC2/DynamoDB call is slow
X-Ray adds to CloudWatch:
Distributed tracing: follow one request across Lambda → API Gateway → DynamoDB
Service map: visual dependency graph with latency/error annotations
Segments + subsegments: timing breakdown of each component
Enable tracing:
Lambda: X-Ray active tracing → one checkbox
EC2: install X-Ray daemon
ECS: X-Ray sidecar container
11. CloudWatch Pricing¶
Metrics:
AWS default metrics: free
Custom metrics: $0.30/metric/month (first 10,000 metrics)
High-resolution: $0.02/metric/month additional
Alarms:
Standard resolution: $0.10/alarm/month
High-resolution: $0.30/alarm/month
Composite alarms: $0.50/alarm/month
Logs:
Ingestion: $0.50/GB
Storage: $0.03/GB/month
Insights query: $0.005/GB scanned
Dashboards:
First 3 dashboards: free
After: $3/dashboard/month
Synthetics canaries:
$0.0012 per canary run
12. Common Mistakes¶
| ❌ Wrong | ✅ Correct |
|---|---|
| CloudWatch monitors EC2 RAM by default | RAM is OS-level — requires CloudWatch Agent |
| Logs are automatically deleted | Default retention: forever — always set a retention policy |
| One alarm = one threshold | Use composite alarms to reduce noise; use anomaly detection for dynamic workloads |
| Alarm actions only send emails | Alarms can trigger: EC2 actions, Auto Scaling, Systems Manager, Lambda, SNS |
| Logs Insights only queries one log group | Logs Insights can query multiple log groups simultaneously |
| Alarm on INSUFFICIENT_DATA = problem | INSUFFICIENT_DATA just means not enough data yet — new resources often start here |
| CloudWatch is region-specific only | Dashboards are global — can show metrics from any region in one dashboard |
| Custom metrics cost the same as standard | High-resolution custom metrics (1-second) cost more than standard (1-minute) |
13. Interview Questions Checklist¶
- Three pillars of observability — which does CloudWatch cover?
- Which EC2 metrics are NOT available by default? What do you need?
- Metric retention periods — what happens to 1-second data after 3 hours?
- Three alarm states — what does INSUFFICIENT_DATA mean?
- Five alarm actions — name all of them
- What is a composite alarm? Why use it?
- What is anomaly detection? How does it differ from a static threshold?
- Write a Logs Insights query: top 10 slowest Lambda invocations
- What is a metric filter? What does it do?
- CloudWatch Agent — what does it add? What IAM policy does it need?
- Real-time log streaming — what are the three destinations?
- What is CloudWatch Synthetics? What does a canary do?