Skip to content

Well-Architected Framework

First PublishedByAtif Alam

The AWS Well-Architected Framework is a set of best practices for evaluating and improving cloud architectures. It’s organized into six pillars — use them as a checklist when designing or reviewing any AWS workload.

PillarCore Question
Operational ExcellenceHow do you run and monitor systems to deliver business value?
SecurityHow do you protect information and systems?
ReliabilityHow do you prevent failures, and recover quickly when they happen?
Performance EfficiencyHow do you use resources efficiently as demand changes?
Cost OptimizationHow do you avoid unnecessary costs?
SustainabilityHow do you minimize environmental impact?

Run and monitor systems to deliver business value and continually improve processes.

PracticeWhat It Means
Infrastructure as CodeDefine everything in Terraform/CloudFormation — no manual console changes
Small, frequent changesDeploy often with small diffs. Easier to debug and roll back.
Automate runbooksTurn manual operational tasks into scripts or Lambda functions
Observe everythingMetrics, logs, traces, alarms. You can’t fix what you can’t see.
Learn from failuresPost-incident reviews, game days, chaos engineering
ServiceRole
CloudFormation / CDK / TerraformInfrastructure as Code
CloudWatchMetrics, logs, alarms, dashboards
X-RayDistributed tracing
Systems ManagerRunbooks, patching, parameter management
ConfigTrack resource configuration changes

Perform operations as code. Make frequent, small, reversible changes. Refine operations procedures frequently. Anticipate failure. Learn from all operational failures.

Protect information, systems, and assets while delivering business value.

PracticeWhat It Means
Least privilegeGrant only the minimum permissions needed
Defense in depthMultiple security layers (WAF, SG, NACL, IAM, encryption)
Automate securityUse Config rules, GuardDuty, Security Hub — don’t rely on manual checks
Encrypt everythingAt rest (KMS) and in transit (TLS)
Protect dataClassify data, control access, enable logging
Enable traceabilityCloudTrail + CloudWatch Logs for full audit trail
ServiceRole
IAMIdentity and access management
KMSEncryption key management
WAF, ShieldApplication and DDoS protection
GuardDutyThreat detection
Secrets ManagerCredential management
CloudTrailAPI audit logging
Security HubCentral security dashboard

Implement a strong identity foundation. Enable traceability. Apply security at all layers. Automate security best practices. Protect data in transit and at rest.

Prevent failures, and recover quickly when they happen.

PracticeWhat It Means
Multi-AZ / multi-regionDistribute across failure domains
Auto-recoveryAuto Scaling, health checks, automated failover
Limit blast radiusIsolate components so one failure doesn’t cascade
Test recoveryPractice failover, backup restoration, disaster recovery
Manage changeAutomate deployments, use canary/blue-green strategies
ServiceRole
Auto ScalingAutomatically replace failed instances
RDS Multi-AZDatabase failover
Route 53 health checksDNS-level failover
S3 (11 nines durability)Durable storage
Elastic Load BalancingDistribute traffic, health checks
BackupCentralized backup management
Route 53 (DNS failover)
CloudFront (edge caching, origin failover)
ALB (health checks, multi-AZ)
Auto Scaling Group (self-healing, multi-AZ)
RDS Multi-AZ (automatic DB failover)
S3 (11 nines durability for backups)

Automatically recover from failure. Test recovery procedures. Scale horizontally. Stop guessing capacity. Manage change in automation.

Use compute, storage, and networking resources efficiently as demand changes.

PracticeWhat It Means
Right-size resourcesMatch instance types to actual workload needs
Use managed servicesLet AWS handle scaling (Lambda, DynamoDB, Fargate)
Go globalCloudFront, multi-region deployments for low latency
ExperimentTest different instance types, storage classes, architectures
Mechanical sympathyUnderstand the technology to use it best (e.g. DynamoDB access patterns)
WorkloadBest Option
Unpredictable trafficLambda, DynamoDB on-demand, Fargate
Steady computeEC2 Reserved / Savings Plans
Static contentCloudFront + S3
High-IOPS databaseEBS io2, Aurora
In-memory accessElastiCache Redis
Global usersCloudFront, Global Accelerator, multi-region

Democratize advanced technologies. Go global in minutes. Use serverless architectures. Experiment more often. Consider mechanical sympathy.

Avoid unnecessary costs and understand where money is going.

PracticeWhat It Means
Understand your costsCost Explorer, tags, budgets
Adopt consumption modelsPay for what you use (Lambda, DynamoDB on-demand, Fargate)
Measure efficiencyCost per transaction, cost per user
Stop spending on undifferentiated heavy liftingUse managed services instead of self-hosting
Analyze and attributeTag everything, track cost per team/project

See Cost Management for detailed strategies.

ActionTypical Savings
Delete unused EBS volumes and snapshotsImmediate
Right-size EC2 instances (Compute Optimizer)10–30%
Savings Plans for steady workloads30–72%
Spot Instances for fault-tolerant workUp to 90%
S3 lifecycle policies40–80% on storage
VPC endpoints for S3/DynamoDBEliminate NAT Gateway data charges

Implement cloud financial management. Adopt a consumption model. Measure overall efficiency. Stop spending money on undifferentiated heavy lifting. Analyze and attribute expenditure.

Minimize the environmental impact of running cloud workloads.

PracticeWhat It Means
Understand your impactAWS Customer Carbon Footprint Tool
Right-sizeDon’t over-provision — idle resources waste energy
Use efficient resourcesGraviton (ARM) instances consume less power
Maximize utilizationConsolidate workloads, use Spot and serverless
Reduce data movementCache at the edge, compress data, minimize cross-region transfers

Understand your impact. Establish sustainability goals. Maximize utilization. Adopt more efficient hardware and software. Reduce the downstream impact of cloud workloads.

AWS provides a Well-Architected Tool in the console to run a structured review of your workload:

  1. Define the workload — name, description, environment, AWS accounts.
  2. Answer questions — ~60 questions across the six pillars.
  3. Review findings — Each answer produces a risk level (High, Medium, None).
  4. Create improvement plan — Prioritized list of actions to address high-risk items.
Terminal window
# List workloads
aws wellarchitected list-workloads
# Create a workload for review
aws wellarchitected create-workload \
--workload-name "My Production App" \
--environment PRODUCTION \
--lenses wellarchitected \
--aws-regions us-east-1
TriggerWhy
New workload launchValidate architecture before production
Major changesNew service, migration, scaling event
AnnuallyCatch drift, adopt new best practices
Post-incidentIdentify systemic weaknesses

A real-world architecture that applies all six pillars:

CloudFront (performance, sustainability)
WAF (security)
ALB + Auto Scaling (reliability, performance)
ECS Fargate on Graviton (cost, sustainability, operational)
Aurora Multi-AZ (reliability)
ElastiCache (performance)
S3 with lifecycle (cost)
CloudWatch + X-Ray (operational)
KMS + Secrets Manager (security)
Cost Explorer + Tags + Budgets (cost)
  • The Well-Architected Framework is a checklist, not a prescription — apply the principles that matter most to your workload.
  • The six pillars are: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.
  • Trade-offs exist — optimizing for cost may reduce reliability; optimizing for performance may increase cost. Make conscious decisions.
  • Run a Well-Architected Review before launch, after major changes, and annually.
  • Most principles boil down to: automate everything, encrypt everything, tag everything, use managed services, deploy across AZs, and monitor continuously.