Case Study: From 3-Week Releases to Daily Deploys — E-Commerce DevOps Overhaul

Client Overview

The Client

Confidential — a Series C e-commerce platform in the lifestyle and fashion vertical. At the time of engagement: 50 million monthly active users, $200M annualised GMV, 180-person engineering organisation split across 12 product squads, and a monolithic Rails application in the middle of a microservices migration.

The client had been on AWS for 6 years. Their cloud footprint was sprawling: 340 EC2 instances (50% with no auto-scaling), 4 EKS clusters across environments, multiple RDS instances with no read replica strategy, and a $280,000/month AWS bill that nobody had audited in 18 months.

AWS EKS ArgoCD GitHub Actions Datadog Terraform PostgreSQL RDS Redis ElastiCache LaunchDarkly karpenter ArgoRollouts

The Challenge

Three Problems Killing Velocity

Problem 1: Release Paralysis

Every release was a 3-week project. Feature freeze 2 weeks before ship date. Manual regression testing across 6 environments. Release night: 8 engineers on a Zoom call from 10pm to 3am, rolling back half the time. In the 6 months before our engagement, 40% of releases required a hotfix within 48 hours.

Problem 2: Reliability Crisis

3 major outages in 6 months. Two were database connection pool exhaustion. One was a bad deployment that took down the payment service for 4 hours during a promotional event — $340K in lost GMV in a single afternoon. MTTR (mean time to recovery) averaged 2.4 hours. No runbooks. No SLOs. No automated rollback.

Problem 3: $280K/Month AWS Bill with 60% Waste

A preliminary audit found 60% of the AWS spend was waste: over-provisioned EC2 instances averaging 12% CPU utilisation, 85 idle RDS read replicas never promoted, NAT Gateway processing $42K/month of traffic that could go through VPC endpoints, and dev/staging environments identical to production running 24/7. Nobody owned FinOps.

The CTO's mandate was unambiguous: ship features at the pace competitors do, stop the outages, and get the AWS bill below $120K/month — without slowing down the 12 product squads. We had 14 weeks.

Solution Architecture

The Architecture

The solution had three pillars: GitOps for deployment safety, EKS optimisation for reliability and cost, and observability-driven incident response. Everything was designed to be operated by the existing engineering team after handover — no proprietary tooling, no vendor lock-in beyond AWS itself.

┌──────────────────────────────────────────────────────────────────────┐
│                         Developer Workflow                           │
│  GitHub PR → GitHub Actions (build+test) → ArgoCD (GitOps deploy)  │
└───────────────────────────┬──────────────────────────────────────────┘
                            │
           ┌────────────────┴───────────────┐
           │                                │
   ┌───────▼───────┐               ┌────────▼────────┐
   │  Staging EKS  │               │ Production EKS  │
   │  (auto-scale  │               │  3 AZs, mixed   │
   │   to zero at  │               │  Spot/On-demand │
   │   night/wknd) │               │  Karpenter      │
   └───────────────┘               └────────┬────────┘
                                            │
              ┌─────────────────────────────┼─────────────────────────┐
              │                             │                          │
    ┌─────────▼────────┐       ┌────────────▼───────┐    ┌────────────▼──────┐
    │ PostgreSQL RDS   │       │ Redis ElastiCache  │    │   Datadog Agent   │
    │ Multi-AZ primary │       │ Cluster mode,      │    │   APM + Logs +    │
    │ + 2 read replicas│       │ 3 shard groups     │    │   SLOs + Alerts   │
    └──────────────────┘       └────────────────────┘    └───────────────────┘
                                            │
                               ┌────────────▼──────────────┐
                               │   ArgoRollouts: Canary     │
                               │   0% → 10% → 50% → 100%  │
                               │   Auto-rollback on error   │
                               └───────────────────────────┘

The key architectural decisions: ArgoCD for GitOps (the Git repository is the single source of truth for what's deployed); ArgoRollouts for canary deployments with automatic rollback based on Datadog SLO breach; Karpenter replacing the previous Cluster Autoscaler for bin-packing and Spot instance management; and LaunchDarkly for feature flags to decouple deployments from feature releases.

Execution

Three Phases, 14 Weeks

Phase 1 · Weeks 1–4

Foundation: Infrastructure as Code & CI Pipelines

Full infrastructure audit: mapped every EC2, RDS, ElastiCache, and NAT Gateway resource. Found 85 "zombie" instances not connected to any live service.
Migrated all infrastructure to Terraform using modules — VPC, EKS clusters, RDS, ElastiCache, IAM. Every resource tagged: team, environment, cost-centre.
Replaced Jenkins (a 6-year-old self-hosted instance nobody wanted to maintain) with GitHub Actions. Built reusable workflows for Node, Python, and Go services.
Established trunk-based development: feature branches merge to main daily, release branches for hotfixes only. Eliminated 2-week feature freeze periods.
Deployed ArgoCD: each team got their own ArgoCD Application, scoped to their namespace. GitOps for staging was live by end of week 4.
Deleted 85 zombie instances and 40 orphaned EBS volumes: immediate AWS cost saving of $31,000/month.

Phase 2 · Weeks 5–9

EKS Optimisation & Database Reliability

Replaced 4 static node groups (all m5.2xlarge on-demand) with Karpenter NodePools. Mixed Spot (70%) / On-demand (30%) across c6g.xlarge, m6g.xlarge, and m6a.xlarge instance types.
All stateless services migrated to Graviton3 (arm64) nodes: 20% compute cost reduction, no performance regression.
Set resource requests/limits on all 890 pod specs using VPA recommendations after 7-day baseline collection.
Deployed Horizontal Pod Autoscaler on all stateless services with custom metrics (req/s via KEDA + SQS queue depth for workers).
Scaled dev/staging to zero at 7pm–8am weekdays and full weekends via CronJob. Saving: ~$18K/month.
PostgreSQL: promoted 2 read replicas that were provisioned but never used for read traffic. Configured PgBouncer connection pooling — eliminated the connection pool exhaustion that caused 2 of the 3 outages.
Redis: migrated from 1 large instance to Cluster mode with 3 shard groups. No more single-point-of-failure on cache.
Added VPC endpoints for S3, ECR, DynamoDB, SSM — NAT Gateway traffic reduced 74%. Saving: ~$31K/month.

Phase 3 · Weeks 10–14

GitOps Cutover, Observability & Team Enablement

ArgoRollouts deployed for all production services: canary strategy with automated Datadog SLO checks. If error rate rises above baseline during a rollout, ArgoRollouts auto-rolls back without human intervention.
LaunchDarkly feature flags decoupled deployments from feature releases. Teams can merge and deploy daily; feature flags control who sees what. Eliminated the concept of "release night."
Datadog SLOs configured for every production service: availability (99.95% target), latency p99 (200ms), and error rate (0.1%). Alerts page on-call engineers, not Slack spam.
Runbooks written for all top-10 alert scenarios. Linked directly from Datadog alerts. Engineers know exactly what to do in the first 10 minutes of any incident.
OpenCost deployed: each squad can see their real-time AWS cost per namespace, per day. Monthly FinOps review added to engineering all-hands.
Delivered 3-day GitOps training workshop for all 12 squad leads. Hands-on: write an ArgoCD Application, configure a canary rollout, trigger and observe an auto-rollback.
Handed over complete runbooks, architecture diagrams, and Terraform modules. Self-sufficient from day 1 post-engagement.

Results

Before & After

Metric	Before	After	Change
Deploy frequency	1 per 3 weeks	8 per day	168× improvement
Release hotfix rate	40% of releases	<3%	93% reduction
MTTR (mean time to recovery)	2.4 hours avg	11 minutes avg	93% faster
Deployment success rate	61%	99.2%	+38 pts
Monthly AWS spend	$280,000	$105,000	63% reduction
Sev-1 incidents	3 in 6 months	0 in 60 days	Eliminated
Node count (production)	87 (all on-demand)	31 avg (70% Spot)	64% fewer nodes
Release night duration	5 hours (10pm–3am)	0 (GitOps, automated)	Eliminated
P99 latency (checkout)	1,240ms	340ms	73% faster

The total AWS saving over the 14-week engagement period was $245,000. Annual run-rate saving at the post-engagement state: $2.1M/year. The engagement fee paid for itself in under 3 weeks of cloud savings, with the engineering velocity improvements (faster releases, eliminated incident cost) representing multiples beyond that.

"Before codetoday, every release was a coin flip. We'd spend two weeks preparing and still roll back half the time. Now our engineers merge to main and go home. The canary rollouts catch issues before they affect real users — and when they do, it auto-rolls back before the on-call engineer has even opened their laptop. The AWS saving was a bonus; the velocity change is what's compounding."

— VP of Engineering, Confidential Series C E-Commerce Platform

Key Learnings

What Actually Mattered

1. GitOps trust is cultural before it's technical. ArgoCD was running in week 4. It took until week 11 before squads trusted it enough to stop manually verifying every deployment. The 3-day training workshop was as important as the tooling.

2. Connection pooling prevented 2 of 3 prior outage patterns. The database outages were not PostgreSQL's fault — they were the result of 890 pods all opening direct connections to RDS. PgBouncer, properly configured, eliminated the problem permanently.

3. Feature flags are not optional for high-frequency deployments. Without LaunchDarkly, deploying 8 times a day would mean 8 opportunities to expose incomplete features. Feature flags decouple code from product launch — essential for trunk-based development.

4. Auto-rollback only works if SLOs are pre-defined. ArgoRollouts' automated rollback mechanism is only as good as the Datadog SLOs it checks. Defining those SLOs rigorously — per service, with agreed error budgets — was the most important non-technical task in the engagement.

5. FinOps is a team sport. The $175K/month saving was not achieved by one person doing a cost audit. It was achieved by 12 squads who could see their own costs in OpenCost and took ownership. Visibility drove accountability.

Ship like this.

Whether you're on EC2 or EKS, releasing weekly or quarterly — we'll show you exactly what's blocking your velocity and costing you money.

Book a free 60-min review

Related Case Studies

From 3-Week Releases to 8 Deploys a Day