How to Cut EC2 Costs by 90% With Spot Instances — The Complete Guide Most Engineers Ignore

Intermediate

Last year I looked at our AWS bill and realized we were spending $14,000/month on EC2 for workloads that could tolerate interruption. Stateless API servers behind a load balancer, CI/CD runners, batch data processing — all running On-Demand. I switched most of them to Spot Instances and our bill dropped to under $2,500. Same workloads, same performance. This guide is everything I learned doing that — the stuff I wish someone had told me on day one.

This is for engineers running compute on AWS who have never seriously tried Spot, or tried once, got interrupted, and gave up. If you’re running stateless applications, containers, batch jobs, or CI/CD pipelines on On-Demand instances, you’re probably overpaying by 60-90%.

What Spot Instances Actually Are (and Why Engineers Avoid Them)

Spot Instances let you use unused EC2 capacity at steep discounts — typically 60-90% off On-Demand pricing. The catch: AWS can reclaim them with a 2-minute warning when it needs the capacity back. That’s it. That’s the entire tradeoff.

Most engineers avoid Spot because of fear. They imagine their production app going down mid-request. But here’s the thing — interruption rates are far lower than people think. According to AWS’s own Spot Instance Advisor, many instance types in popular regions see interruption rates below 5%. Some are under 2%.

Here are real-world interruption frequency ranges from the AWS Spot Instance Advisor (us-east-1, as of mid-2024):

m5.xlarge: <5% interruption frequency
c5.xlarge: <5% interruption frequency
t3.xlarge: <5% interruption frequency
r5.xlarge: 5-10% interruption frequency
m5.large: <5% interruption frequency

These numbers vary by region and AZ — always check the Advisor before choosing. The general rule: diversify across instance types and AZs, and your effective interruption impact drops dramatically.

Real Dollar Comparison: On-Demand vs Spot

Let’s do the math with a concrete example. You’re running 10 x m5.xlarge instances in us-east-1, 24/7 for a month (730 hours).

Pricing Model	Hourly per Instance	Monthly (10 instances)
On-Demand	$0.192	$1,401.60
Spot (typical)	~$0.0650	~$474.50
Monthly Savings		~$927.10 (66%)

Spot pricing fluctuates, but m5.xlarge in us-east-1 commonly runs between $0.05-$0.08/hr. That’s a savings of $11,125/year — on just 10 instances. If you diversify across m5, m5a, m5n, and m4 families (which I’ll show you how to do), you’ll get even more consistent pricing and availability.

Check current Spot pricing yourself:

aws ec2 describe-spot-price-history \
  --instance-types m5.xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --region us-east-1 \
  --query 'SpotPriceHistory[*].{AZ:AvailabilityZone,Price:SpotPrice,Time:Timestamp}' \
  --output table

Architectures That Thrive on Spot (and Ones That Don’t)

Perfect for Spot:

Stateless web/API servers behind an ALB with Auto Scaling — if one instance dies, others absorb traffic while a replacement launches in seconds.
CI/CD runners (GitHub Actions self-hosted, GitLab runners, Jenkins agents) — jobs are short-lived and retryable.
Batch processing — MapReduce, video transcoding, data pipelines. Checkpointing makes interruption painless.
Containerized workloads on ECS or EKS — orchestrators reschedule containers automatically on healthy nodes.
Dev/test environments — nobody cares if your staging server blips for 30 seconds.

Risky or bad for Spot:

Databases (RDS, self-managed MySQL/Postgres) — losing a primary mid-transaction is catastrophic.
Stateful singleton services — anything where exactly one instance must be running at all times with no interruption.
Long-running HPC jobs without checkpointing — losing 47 hours of a 48-hour simulation is painful.

How the 2-Minute Interruption Notice Works

When AWS reclaims a Spot Instance, it does two things: it posts an entry to the instance metadata service and emits an EventBridge event. You get exactly 2 minutes to gracefully shut down.

Your instance can poll the metadata endpoint:

curl -s http://169.254.169.254/latest/meta-data/spot/instance-action

If no interruption is pending, this returns a 404. When interruption is imminent, it returns JSON like:

{"action": "terminate", "time": "2024-07-15T14:02:00Z"}

Here’s a real, working bash script that polls for the interruption notice and handles graceful shutdown:

#!/bin/bash
# spot-interruption-handler.sh
# Run this as a systemd service or background process on your Spot instances.

METADATA_URL="http://169.254.169.254/latest/meta-data/spot/instance-action"
POLL_INTERVAL=5

echo "$(date): Spot interruption handler started. Polling every ${POLL_INTERVAL}s..."

while true; do
  HTTP_CODE=$(curl -s -o /tmp/spot-action.json -w "%{http_code}" "$METADATA_URL" 2>/dev/null)

  if [ "$HTTP_CODE" -eq 200 ]; then
    ACTION=$(cat /tmp/spot-action.json)
    echo "$(date): SPOT INTERRUPTION NOTICE RECEIVED: $ACTION"

    # Step 1: Deregister from load balancer (if applicable)
    INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
    echo "$(date): Deregistering instance $INSTANCE_ID from target group..."
    # Uncomment and set your target group ARN:
    # aws elbv2 deregister-targets \
    #   --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
    #   --targets Id=$INSTANCE_ID

    # Step 2: Stop accepting new work
    echo "$(date): Sending SIGTERM to application..."
    pkill -TERM -f "my-application-process"

    # Step 3: Wait for in-flight requests to drain (up to 90 seconds)
    sleep 90

    # Step 4: Push any final logs/metrics
    echo "$(date): Flushing logs..."
    # Example: sync logs to S3
    # aws s3 cp /var/log/myapp/ s3://my-log-bucket/final-logs/ --recursive

    echo "$(date): Graceful shutdown complete. Instance will be terminated by AWS."
    exit 0
  fi

  sleep "$POLL_INTERVAL"
done

For container workloads on ECS or EKS, AWS handles much of this natively. ECS Spot draining is built in — enable ECS_ENABLE_SPOT_INSTANCE_DRAINING=true in your container agent config, and ECS automatically sets tasks to DRAINING when the interruption notice arrives.

Spot With Auto Scaling Groups: The Right Way

The single most important thing: use a mixed instances policy across multiple instance types and Availability Zones. This is how you make Spot reliable.

Here’s the AWS CLI command to create a mixed-instance ASG:

aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name my-spot-asg \
  --mixed-instances-policy '{
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "my-launch-template",
        "Version": "$Latest"
      },
      "Overrides": [
        {"InstanceType": "m5.xlarge"},
        {"InstanceType": "m5a.xlarge"},
        {"InstanceType": "m5n.xlarge"},
        {"InstanceType": "m4.xlarge"},
        {"InstanceType": "m5d.xlarge"},
        {"InstanceType": "c5.xlarge"}
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 2,
      "OnDemandPercentageAboveBaseCapacity": 20,
      "SpotAllocationStrategy": "capacity-optimized"
    }
  }' \
  --min-size 6 \
  --max-size 20 \
  --desired-capacity 10 \
  --availability-zones "us-east-1a" "us-east-1b" "us-east-1c"

What this does:

OnDemandBaseCapacity: 2 — always keep 2 On-Demand instances as a baseline (your safety net).
OnDemandPercentageAboveBaseCapacity: 20 — of the remaining 8 instances, 20% (about 2) are On-Demand, 80% (about 6) are Spot.
SpotAllocationStrategy: capacity-optimized — AWS picks Spot pools with the most available capacity, minimizing interruptions. This is almost always what you want over lowest-price.
6 instance type overrides — gives AWS maximum flexibility to find capacity.

To request a single Spot Instance directly (useful for batch jobs or testing):

aws ec2 run-instances \
  --image-id ami-0c02fb55956c7d316 \
  --instance-type m5.xlarge \
  --instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"one-time","InstanceInterruptionBehavior":"terminate"}}' \
  --key-name my-key-pair \
  --security-group-ids sg-0abc123def456 \
  --subnet-id subnet-0abc123 \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=spot-test}]'

Common Mistakes That Burn Engineers

1. Using a single instance type. If you request only m5.xlarge Spot, you’re competing with everyone else who wants m5.xlarge. When that pool runs out, you get interrupted. Always specify at least 4-6 instance types with similar vCPU/RAM profiles.

2. Running in a single Availability Zone. Spot capacity varies by AZ. Spreading across 3 AZs dramatically reduces the chance of all your instances being reclaimed simultaneously.

3. Not handling the interruption signal. If you don’t poll the metadata endpoint or listen to EventBridge, your app gets killed mid-request with no cleanup. Implement the handler. It takes 20 minutes and saves you from mystery outages.

4. Using lowest-price allocation strategy. It sounds smart but it concentrates your instances in the cheapest (and most contested) pool. Use capacity-optimized instead — it picks pools with the most spare capacity, which means fewer interruptions.

5. No On-Demand baseline. Running 100% Spot with zero On-Demand fallback means a regional Spot shortage can take your entire fleet offline. Always keep an On-Demand base — even 10-20% — as your safety net.

Conclusion

Spot Instances are the single highest-impact cost optimization available on AWS, and most teams ignore them out of unfounded fear. The interruption risk is real but manageable — and for the right workloads, the 60-90% savings are transformative.

Spot is ideal for stateless apps, batch jobs, CI/CD runners, and containerized workloads — avoid it for databases and stateful singletons.
Always diversify across at least 4-6 instance types and 3 Availability Zones with capacity-optimized allocation.
Handle the 2-minute interruption notice — poll the metadata endpoint or use EventBridge, and drain gracefully.
Use mixed instances policies in ASGs with an On-Demand baseline to guarantee minimum capacity.
Real savings are massive: 10 x m5.xlarge drops from ~$1,400/month to ~$475/month — that’s over $11,000/year back in your budget.

Found this helpful? Share it with your team. For more practical AWS and DevOps guides, visit riseofcloud.com.

Let’s keep learning consistently at a medium pace.