Monitoring
AWS provides built-in monitoring through CloudWatch (metrics, logs, alarms, dashboards) and CloudTrail (API audit logging). Together they answer “what’s happening?” and “who did what?”
CloudWatch
Section titled “CloudWatch”CloudWatch collects and visualizes metrics and logs from AWS services and your applications.
Metrics
Section titled “Metrics”Every AWS service automatically sends metrics to CloudWatch. You can also push custom metrics from your applications.
Built-In Metrics (Examples)
Section titled “Built-In Metrics (Examples)”| Service | Metric | What It Measures |
|---|---|---|
| EC2 | CPUUtilization | CPU usage (%) |
| EC2 | NetworkIn / NetworkOut | Network traffic (bytes) |
| RDS | DatabaseConnections | Active DB connections |
| RDS | FreeStorageSpace | Available disk space |
| ALB | RequestCount | Total requests |
| ALB | TargetResponseTime | Latency (seconds) |
| Lambda | Invocations | Function calls |
| Lambda | Duration | Execution time (ms) |
| S3 | BucketSizeBytes | Bucket storage size |
| SQS | ApproximateNumberOfMessagesVisible | Queue depth |
Note: EC2 metrics are at 5-minute intervals by default. Enable detailed monitoring for 1-minute intervals (additional cost).
Custom Metrics
Section titled “Custom Metrics”Push your own application metrics to CloudWatch:
# CLIaws cloudwatch put-metric-data --namespace "MyApp" \ --metric-name "OrdersProcessed" --value 42 --unit Count
# With dimensions (like Prometheus labels)aws cloudwatch put-metric-data --namespace "MyApp" \ --metric-name "OrdersProcessed" --value 42 --unit Count \ --dimensions Environment=production,Service=orders-apiPython (boto3):
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data( Namespace='MyApp', MetricData=[{ 'MetricName': 'OrdersProcessed', 'Value': 42, 'Unit': 'Count', 'Dimensions': [ {'Name': 'Environment', 'Value': 'production'}, {'Name': 'Service', 'Value': 'orders-api'}, ] }])Key Concepts
Section titled “Key Concepts”| Concept | What It Is |
|---|---|
| Namespace | A container for metrics (e.g. AWS/EC2, AWS/RDS, MyApp) |
| Dimension | A key-value pair that identifies a specific metric stream (e.g. InstanceId=i-abc) |
| Period | The time aggregation window (60s, 300s, etc.) |
| Statistic | How to aggregate: Average, Sum, Minimum, Maximum, SampleCount, percentiles (p99) |
Alarms
Section titled “Alarms”Alarms watch a metric and trigger actions when a threshold is crossed.
# Create an alarm: notify when CPU > 80% for 5 minutesaws cloudwatch put-metric-alarm \ --alarm-name "HighCPU" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \ --dimensions Name=InstanceId,Value=i-abc123Alarm States
Section titled “Alarm States”| State | Meaning |
|---|---|
| OK | Metric is within the threshold |
| ALARM | Metric exceeded the threshold for the evaluation period |
| INSUFFICIENT_DATA | Not enough data to determine state |
Alarm Actions
Section titled “Alarm Actions”| Action | What It Does |
|---|---|
| SNS notification | Send to email, Slack (via Lambda), PagerDuty |
| Auto Scaling | Scale up/down EC2 instances |
| EC2 action | Stop, terminate, reboot, or recover an instance |
| Lambda | Trigger a function for custom remediation |
Composite Alarms
Section titled “Composite Alarms”Combine multiple alarms with AND/OR logic:
# ALARM only when BOTH CPU is high AND memory is highaws cloudwatch put-composite-alarm \ --alarm-name "HighResourceUsage" \ --alarm-rule 'ALARM("HighCPU") AND ALARM("HighMemory")'Reduces alert noise — alert on the combination of symptoms, not each one individually.
CloudWatch Logs
Section titled “CloudWatch Logs”CloudWatch Logs stores and queries log data from AWS services and your applications.
Log Structure
Section titled “Log Structure”Log Group: /aws/lambda/my-function ← container (one per app/service) └── Log Stream: 2026/02/16/[$LATEST] ← one stream per source instance ├── log event: "START RequestId..." ├── log event: "Processing order 123..." └── log event: "END RequestId..."| Concept | What It Is |
|---|---|
| Log group | A collection of log streams. Set retention (1 day – indefinite). |
| Log stream | A sequence of log events from one source (one Lambda invocation, one EC2 instance). |
| Log event | A single log line with a timestamp. |
Sending Logs to CloudWatch
Section titled “Sending Logs to CloudWatch”| Source | How |
|---|---|
| Lambda | Automatic (logs go to /aws/lambda/<function-name>) |
| ECS | Use awslogs log driver in task definition |
| EC2 | Install the CloudWatch agent |
| On-premises | Install the CloudWatch agent |
| API Gateway | Enable access logging |
| RDS | Enable in database parameter group |
CloudWatch Agent config (EC2):
{ "logs": { "logs_collected": { "files": { "collect_list": [{ "file_path": "/var/log/myapp/*.log", "log_group_name": "/myapp/production", "log_stream_name": "{instance_id}" }] } } }}Log Insights (Querying)
Section titled “Log Insights (Querying)”CloudWatch Logs Insights lets you query logs with a SQL-like language:
# Find errors in the last hourfields @timestamp, @message| filter @message like /ERROR/| sort @timestamp desc| limit 50
# Count errors per 5 minutesfields @timestamp, @message| filter @message like /ERROR/| stats count(*) as error_count by bin(5m)
# Find slow Lambda invocationsfields @timestamp, @duration, @requestId| filter @duration > 5000| sort @duration desc| limit 20
# Parse JSON logs and filterfields @timestamp, @message| parse @message '{"level":"*","msg":"*","duration":*}' as level, msg, duration| filter level = "error"| stats count(*) by msgMetric Filters
Section titled “Metric Filters”Turn log patterns into CloudWatch metrics:
# Create a metric every time "ERROR" appears in logsaws logs put-metric-filter \ --log-group-name /myapp/production \ --filter-name ErrorCount \ --filter-pattern "ERROR" \ --metric-transformations \ metricName=ErrorCount,metricNamespace=MyApp,metricValue=1Now you can create alarms on ErrorCount — alert when errors spike.
CloudWatch Dashboards
Section titled “CloudWatch Dashboards”Create visual dashboards in the AWS console combining metrics, logs, and alarms:
aws cloudwatch put-dashboard --dashboard-name "Production" \ --dashboard-body '{ "widgets": [{ "type": "metric", "properties": { "title": "CPU Usage", "metrics": [["AWS/EC2", "CPUUtilization", "InstanceId", "i-abc"]], "period": 300 } }] }'Tips:
- Group metrics by service (one row per service).
- Put alarms and key stats at the top.
- Use automatic dashboards — CloudWatch auto-generates dashboards for many services.
CloudWatch vs Prometheus/Grafana
Section titled “CloudWatch vs Prometheus/Grafana”| CloudWatch | Prometheus + Grafana | |
|---|---|---|
| Setup | Built-in (zero config for AWS services) | Self-hosted (more setup) |
| Custom metrics | API calls (can get expensive at high volume) | Pull-based (scrape targets) |
| Query language | Logs Insights (limited) | PromQL + LogQL (powerful) |
| Dashboards | Basic | Advanced (Grafana is far more flexible) |
| Cost | Pay per metric, alarm, log volume | Infrastructure cost only |
| Best for | AWS-native monitoring, small setups | Large-scale, multi-cloud, K8s-native |
Many teams use both: CloudWatch for AWS-native metrics (billing, Lambda, RDS) and Prometheus/Grafana for application-level metrics and Kubernetes.
CloudTrail
Section titled “CloudTrail”CloudTrail logs every API call made in your AWS account — who did what, when, and from where.
What CloudTrail Records
Section titled “What CloudTrail Records”| Field | Example |
|---|---|
| Event time | 2026-02-16T10:30:00Z |
| User | arn:aws:iam::123456789012:user/alice |
| Source IP | 203.0.113.50 |
| Action | RunInstances, DeleteBucket, CreateUser |
| Resource | arn:aws:ec2:us-east-1:123456789012:instance/i-abc |
| Result | Success or error code |
Use Cases
Section titled “Use Cases”- Security audit — Who launched that instance? Who changed that security group?
- Compliance — Prove that only authorized users accessed sensitive data.
- Incident investigation — Trace the sequence of API calls during a breach.
- Change tracking — What changed in the last 24 hours?
Setting Up CloudTrail
Section titled “Setting Up CloudTrail”# Create a trail that logs all events to S3aws cloudtrail create-trail \ --name my-trail \ --s3-bucket-name my-cloudtrail-logs \ --is-multi-region-trail \ --enable-log-file-validation
# Start loggingaws cloudtrail start-logging --name my-trailEvent Types
Section titled “Event Types”| Type | What It Logs | Enabled by Default |
|---|---|---|
| Management events | Control-plane API calls (create, delete, modify resources) | Yes (free for last 90 days) |
| Data events | Data-plane operations (S3 GetObject, Lambda Invoke) | No (additional cost) |
| Insights events | Unusual activity patterns (API call spikes) | No (additional cost) |
CloudTrail + CloudWatch Logs
Section titled “CloudTrail + CloudWatch Logs”Send CloudTrail events to CloudWatch Logs for real-time alerting:
aws cloudtrail update-trail --name my-trail \ --cloud-watch-logs-log-group-arn arn:aws:logs:us-east-1:123456789012:log-group:CloudTrail \ --cloud-watch-logs-role-arn arn:aws:iam::123456789012:role/CloudTrailCloudWatchThen create metric filters to alert on suspicious activity:
# Alert on root account usageaws logs put-metric-filter \ --log-group-name CloudTrail \ --filter-name RootAccountUsage \ --filter-pattern '{ $.userIdentity.type = "Root" }' \ --metric-transformations \ metricName=RootAccountUsage,metricNamespace=Security,metricValue=1Key Takeaways
Section titled “Key Takeaways”- CloudWatch Metrics monitor AWS resources automatically. Add custom metrics for application-level data.
- CloudWatch Alarms trigger notifications or auto-scaling when thresholds are crossed. Use composite alarms to reduce noise.
- CloudWatch Logs collect logs from Lambda, ECS, EC2, and more. Use Logs Insights for querying and metric filters for alerting.
- CloudTrail logs every API call for security auditing and compliance. Enable multi-region trails and send events to CloudWatch Logs.
- For Kubernetes and multi-cloud environments, consider supplementing CloudWatch with Prometheus + Grafana.