Using Fargate Spot in Production Without Getting Burned (and Still Save ~60%)

In a recent consulting project, I tackled an interesting challenge that combined my passion for cloud cost optimization with the practicalities of AWS infrastructure. I devised a strategy that allowed me to reduce compute costs by around 60% using AWS Fargate Spot instances, all while maintaining reliability. Here’s how I did it.

Why Fargate Spot?

Fargate Spot offers a fantastic opportunity to optimise compute costs by taking advantage of spare AWS capacity at up to ~70% off the standard rate. The catch? Spot capacity can be reclaimed by AWS at any time with very short notice.

You might expect that the capacity provider strategy would automatically rebalance workloads between Fargate and Spot. But it doesn’t. If AWS takes your Spot capacity, the service won’t automatically launch new tasks on regular Fargate to compensate — it just drops capacity.

“I thought ECS would fallback to regular Fargate when Spot fails, but nope — it just silently drops tasks. Had to learn the hard way.” — https://www.reddit.com/r/aws/comments/147hwx0/ecs_capacity_providers_fargate_and_fargate_spot/

“Fargate doesn’t replace Spot capacity with on-demand capacity.” — https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html

The Naive Approach

One common way to work around this is to use CloudWatch alarms and Lambda functions to detect low task counts and then dynamically shift the capacity provider strategy.

“I’ve built a Lambda to detect task count drops and flip the service to regular Fargate. It works, but testing it is rough.” — https://gist.github.com/ahmadnassri/1be7a3910c7cf56e65d25b377731e3f1

Testing this reliably is difficult. Spot interruptions are inherently unpredictable and infrequent, making it tough to validate if your failover logic really works. Even AWS acknowledges this limitation in their documentation:

“Fargate Spot runs on spare capacity and there might be no availability at times, which makes testing fallback solutions difficult.” — https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html

The Dual-Service Solution

To overcome these limitations, I devised a more robust solution: run two separate ECS Fargate services behind the same load balancer.

One service runs entirely on Fargate Spot
The other runs on regular Fargate

By tuning the auto-scaling policies, the Spot service handles the majority of traffic. If Spot capacity is taken away, the regular Fargate service remains as a warm backup and scales up to handle the load.

This design doesn’t rely on reacting to events — it’s proactively resilient.

How It Works

1. Shared Task Definition

Both services share the same TaskDefinition:

TaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    RequiresCompatibilities: [FARGATE]
    Cpu: !Ref Cpu
    Memory: !Ref Memory
    NetworkMode: awsvpc
    ...

2. Spot-Only Service

SpotService:
  Type: AWS::ECS::Service
  Properties:
    CapacityProviderStrategy:
      - CapacityProvider: FARGATE_SPOT
        Weight: 1
    DesiredCount: !Ref MinCapacity

3. On-Demand Backup Service

OnDemandService:
  Type: AWS::ECS::Service
  Properties:
    CapacityProviderStrategy:
      - CapacityProvider: FARGATE
        Weight: 1
    DesiredCount: 1

4. Smart Scaling

Each service gets its own ScalableTarget and ScalingPolicy. Here’s an example for CPU scaling:

SpotCpuScaling:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    ScalingTargetId: !Ref SpotScalingTarget
    TargetTrackingScalingPolicyConfiguration:
      TargetValue: 50.0  # More aggressive

FargateCpuScaling:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    ScalingTargetId: !Ref FargateScalingTarget
    TargetTrackingScalingPolicyConfiguration:
      TargetValue: 80.0  # Higher threshold

The lower threshold on Spot ensures it handles load first, while the Fargate service only kicks in when needed.

Results

This setup has now been running for 6+ months in a production environment. The result?

~60% cost savings on compute
No downtime due to Spot interruptions
No Lambda glue code or runtime logic
Easily testable, reproducible deployment via CloudFormation

It’s simple, robust, and safe — and you can deploy it in any ECS cluster with minimal changes.

Try It Yourself

You can adopt this strategy by copying and adapting this CloudFormation template. It uses:

Dual ECS services with shared task definition
Tuned CPU auto-scaling policies for service balancing
Log configuration to combine logs from both services
Private subnet network configuration

All production-ready and reusable.

Final Thoughts

Fargate Spot is underused because of its unpredictability, but this architecture shows that with a bit of redundancy and the right scaling strategy, it’s possible to build cost-optimised and resilient workloads without complexity.

Got questions or want to share how you’re using Spot? Drop me a note!

← Back to Blog

© 2025 Matt Blackford.
Text content is licensed under CC BY-NC-ND 4.0.
Code snippets are licensed under the MIT License.