FinOps / MLOps

The AWS Cost Spiral: How SageMaker Zombie Endpoints Quietly Burn $700K/yr (And How to Kill Them)

Ajeet Kumar · Platform Engineering Lead, codetoday.io May 2025 9 min read

It starts with a demo. A model gets deployed to a SageMaker endpoint, the demo goes well, and the ML team moves on to the next sprint. Three months later, nobody remembers the endpoint exists — but it's still running, silently billing $0.736 per hour. Multiply that across dozens of endpoints and three AWS accounts, and you have $58,000 disappearing every month. This is the story of how we found them, killed them safely, and built the system to prevent it from happening again.

The Anatomy of a Zombie Endpoint

A SageMaker zombie endpoint is a real-time inference endpoint in InService status with zero invocations over a sustained period — typically 30 days or more. It looks healthy in the console. It passes all health checks. It is simply doing nothing while the billing meter runs at full speed.

Here is how one is born. An ML engineer is preparing a product demo for stakeholders. They create a SageMaker endpoint with an ml.g4dn.xlarge instance — a sensible choice for a computer vision or NLP model. The demo goes well. Stakeholders are impressed. The project gets greenlit. The engineer starts the real implementation sprint — and the demo endpoint is never mentioned again.

Six months later, the team has trained a better model, deployed it to a new endpoint, and promoted it to production. The demo endpoint still exists. Nobody thinks to delete it because nobody's looking at it. It doesn't appear in any dashboard. It doesn't generate alerts. It just silently processes zero requests at $0.736 per hour.

Do the math: $0.736/hr × 24hr × 365 days = $6,446 per year for a single ml.g4dn.xlarge endpoint. For larger instances like ml.g5.4xlarge used in LLM serving, that number jumps to $31,000+ per year per zombie.

The problem compounds because ML teams operate fast. A model improvement cycle might create and abandon 3–4 endpoints per quarter. Across a team of 12 ML engineers working across 3 AWS accounts, you can accumulate 30–40 zombie endpoints in under a year without any malicious intent — just the normal entropy of a fast-moving team with no enforcement mechanisms.

⚠ Warning: SageMaker Multi-Model Endpoints Are Not Immune A common misconception is that Multi-Model Endpoints (MMEs) solve this problem. They don't. An MME with no invocations is still billed at the full instance rate. We've seen MMEs with 50+ loaded models generating exactly zero predictions — billing $4,000+/month. The zombie problem applies to every endpoint type: real-time, async, serverless (if auto-scaling floor is set above zero), and multi-model.

Forensic Investigation: Finding Your Zombies

The fastest manual approach uses two CLI commands in sequence. First, list every InService endpoint. Then query CloudWatch to check whether InvocationsPerInstance has been zero for the past 30 days.

# Step 1: List all InService endpoints across a region
aws sagemaker list-endpoints \
  --status-filter InService \
  --query "Endpoints[*].{Name:EndpointName,Created:CreationTime,Modified:LastModifiedTime}" \
  --output table \
  --region us-east-1

# Step 2: Check invocations for a specific endpoint (last 30 days)
aws cloudwatch get-metric-statistics \
  --namespace AWS/SageMaker \
  --metric-name Invocations \
  --dimensions Name=EndpointName,Value=YOUR_ENDPOINT_NAME \
               Name=VariantName,Value=AllTraffic \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 2592000 \
  --statistics Sum \
  --query "Datapoints[*].Sum" \
  --output text

# If the above returns nothing (no datapoints) or 0.0, the endpoint is a zombie.
# Cross-check with Cost Explorer for the actual billing amount:
aws ce get-cost-and-usage \
  --time-period Start=2025-04-01,End=2025-05-01 \
  --granularity MONTHLY \
  --filter '{"And":[{"Dimensions":{"Key":"SERVICE","Values":["Amazon SageMaker"]}},{"Tags":{"Key":"aws:sagemaker:endpoint-name","Values":["YOUR_ENDPOINT_NAME"]}}]}' \
  --metrics BlendedCost \
  --output json

The CLI approach works for spot-checking. For fleet-wide detection across dozens or hundreds of endpoints — especially across multiple accounts — you need the automation script below.

The Detection Script: Find Every Zombie in One Run

This Python script uses boto3 to scan every InService SageMaker endpoint in an account, queries CloudWatch for 30-day invocation totals, estimates monthly cost by instance type, and prints a ranked table of zombies sorted by monthly burn.

#!/usr/bin/env python3
"""
sagemaker_zombie_detector.py
Finds SageMaker endpoints with zero invocations in the last 30 days.
Usage: python3 sagemaker_zombie_detector.py --region us-east-1
"""
import boto3, argparse
from datetime import datetime, timezone, timedelta
from typing import List, Dict

# Approximate hourly cost per instance type (USD, on-demand, us-east-1)
INSTANCE_COSTS = {
    "ml.t2.medium": 0.056, "ml.m5.large": 0.134,
    "ml.m5.xlarge": 0.269, "ml.c5.xlarge": 0.238,
    "ml.g4dn.xlarge": 0.736, "ml.g4dn.2xlarge": 1.218,
    "ml.g5.xlarge": 1.408, "ml.g5.4xlarge": 3.553,
    "ml.p3.2xlarge": 4.284,
}

def get_30d_invocations(cw, endpoint_name: str) -> float:
    now = datetime.now(timezone.utc)
    resp = cw.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName="Invocations",
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": "AllTraffic"},
        ],
        StartTime=now - timedelta(days=30),
        EndTime=now,
        Period=2592000,  # 30 days in seconds
        Statistics=["Sum"],
    )
    datapoints = resp.get("Datapoints", [])
    return sum(d["Sum"] for d in datapoints)

def find_zombies(region: str) -> List[Dict]:
    sm = boto3.client("sagemaker", region_name=region)
    cw = boto3.client("cloudwatch", region_name=region)
    zombies = []
    paginator = sm.get_paginator("list_endpoints")
    for page in paginator.paginate(StatusEquals="InService"):
        for ep in page["Endpoints"]:
            detail = sm.describe_endpoint(EndpointName=ep["EndpointName"])
            instance_type = detail["ProductionVariants"][0].get(
                "CurrentInstanceType", "unknown"
            )
            invocations = get_30d_invocations(cw, ep["EndpointName"])
            if invocations == 0:
                hourly = INSTANCE_COSTS.get(instance_type, 0.5)
                created = ep["CreationTime"].replace(tzinfo=timezone.utc)
                days_idle = (datetime.now(timezone.utc) - created).days
                zombies.append({
                    "name": ep["EndpointName"],
                    "instance": instance_type,
                    "monthly_cost_usd": round(hourly * 24 * 30, 2),
                    "days_idle": days_idle,
                })
    return sorted(zombies, key=lambda x: x["monthly_cost_usd"], reverse=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--region", default="us-east-1")
    args = parser.parse_args()
    zombies = find_zombies(args.region)
    total_monthly = sum(z["monthly_cost_usd"] for z in zombies)
    print(f"\n{'ENDPOINT NAME':<45} {'INSTANCE':<20} {'$/MO':>8} {'DAYS IDLE':>10}")
    print("-" * 90)
    for z in zombies:
        print(f"{z['name']:<45} {z['instance']:<20} {z['monthly_cost_usd']:>8.2f} {z['days_idle']:>10}")
    print(f"\n  Total zombie burn: ${total_monthly:,.2f}/month  (${total_monthly*12:,.0f}/yr)\n")

How to Kill Them Safely

Never delete a SageMaker endpoint without first confirming the model artifacts are safely persisted. The endpoint itself is just a serving wrapper — the model files live in S3, and deleting the endpoint does not delete the model. But the endpoint configuration and any associated custom inference code can be lost if you're not careful. Follow this four-step process.

Step 1: Snapshot Model Artifacts to a Known S3 Path

# Confirm model artifacts are accessible before deleting the endpoint
aws sagemaker describe-endpoint --endpoint-name YOUR_ZOMBIE_ENDPOINT \
  --query "ProductionVariants[*].{Variant:VariantName,Model:ModelArn}"

# Get model data URL from the model object
aws sagemaker describe-model --model-name YOUR_MODEL_NAME \
  --query "PrimaryContainer.ModelDataUrl"

# Copy to a long-term archive prefix (belt-and-suspenders):
aws s3 cp s3://your-bucket/path/to/model.tar.gz \
          s3://your-archive-bucket/zombie-archive/$(date +%Y-%m-%d)/model.tar.gz

Step 2: Tag with zombie:confirmed Before Deletion

# Tag the endpoint — creates an audit record before deletion
aws sagemaker add-tags \
  --resource-arn arn:aws:sagemaker:us-east-1:123456789012:endpoint/YOUR_ZOMBIE_ENDPOINT \
  --tags Key=zombie,Value=confirmed \
         Key=zombie-confirmed-by,Value=ajeet.kumar@codetoday.io \
         Key=zombie-confirmed-date,Value=$(date +%Y-%m-%d)

Step 3: Delete the Endpoint

# The deletion itself is a single command — irreversible, so confirm tags first
aws sagemaker delete-endpoint \
  --endpoint-name YOUR_ZOMBIE_ENDPOINT \
  --region us-east-1

# Confirm deletion:
aws sagemaker describe-endpoint --endpoint-name YOUR_ZOMBIE_ENDPOINT 2>&1 | grep -i "does not exist"

Step 4: Implement an AWS Config Rule for Deletion Prevention

After cleanup, prevent recurrence with an AWS Config custom rule that fires a non-compliant finding for any SageMaker endpoint older than 90 days that lacks an auto-delete-after tag. Non-compliant findings trigger an SNS notification to the owning team.

The Prevention System: Never Get Here Again

Detection and deletion is reactive. The real goal is a system where zombie endpoints can't accumulate in the first place. We implement three components: auto-tagging on creation, weekly anomaly alerts, and an SCP-enforced tag policy.

EventBridge + Lambda Auto-Tagger

# EventBridge rule (CloudFormation) — fires when a SageMaker endpoint is created
SageMakerEndpointCreatedRule:
  Type: AWS::Events::Rule
  Properties:
    Name: sagemaker-endpoint-created-tagger
    EventPattern:
      source: ["aws.sagemaker"]
      detail-type: ["AWS API Call via CloudTrail"]
      detail:
        eventName: ["CreateEndpoint"]
    State: ENABLED
    Targets:
      - Arn: !GetAtt EndpointTaggerLambda.Arn
        Id: EndpointTaggerTarget

---
# Lambda function body (Python 3.12)
import boto3, json
from datetime import datetime, timedelta

def handler(event, context):
    sm = boto3.client('sagemaker')
    detail = event['detail']
    endpoint_name = detail['requestParameters']['endpointName']
    created_by = detail['userIdentity']['arn']
    auto_delete_after = (datetime.utcnow() + timedelta(days=90)).strftime('%Y-%m-%d')
    endpoint_arn = f"arn:aws:sagemaker:{detail['awsRegion']}:{detail['recipientAccountId']}:endpoint/{endpoint_name}"
    sm.add_tags(
        ResourceArn=endpoint_arn,
        Tags=[
            {'Key': 'created-by', 'Value': created_by},
            {'Key': 'last-invoked', 'Value': 'never'},
            {'Key': 'auto-delete-after', 'Value': auto_delete_after},
        ]
    )
    print(f"Tagged {endpoint_name}: auto-delete-after={auto_delete_after}, created-by={created_by}")

Service Control Policy: Enforce the Tag Mandate

The SCP below, applied at the AWS Organization level, prevents any IAM principal from creating a SageMaker endpoint unless the request includes an auto-delete-after tag. No tag, no endpoint — the request is denied at the AWS Organizations layer before it reaches SageMaker.

# SCP — require auto-delete-after tag on all SageMaker CreateEndpoint calls
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "RequireAutoDeleteAfterTag",
    "Effect": "Deny",
    "Action": "sagemaker:CreateEndpoint",
    "Resource": "*",
    "Condition": {
      "Null": {
        "aws:RequestTag/auto-delete-after": "true"
      }
    }
  }]
}

The $700K Story: Real Numbers From a Real Engagement

In Q1 2025, a data & AI platform team at a Series D fintech brought us in for a cloud cost review. They were running an MLOps platform across three AWS accounts: prod, staging, and experimentation. Their monthly SageMaker bill had grown from $12K to $67K over 14 months. Leadership assumed it was model traffic growth. It wasn't.

We ran the detection script across all three accounts. The results:

  • 38 zombie endpoints found across the three accounts
  • $58,200/month in idle endpoint costs — 87% of the total SageMaker bill
  • Largest single zombie: an ml.p3.2xlarge endpoint serving a BERT-based fraud model that was superseded 8 months prior — $3,085/month for 8 months = $24,680 total waste
  • Oldest zombie: 14 months old, from a POC that was shelved before it launched
38
Zombie endpoints
across 3 accounts
$58K
Monthly burn
before cleanup
$4K
Monthly burn
after 48 hours
$648K
Projected annual
savings

Within 48 hours of detection, 36 of the 38 zombies were deleted (two required additional sign-off from data science leads who weren't available immediately). The monthly SageMaker bill dropped from $67K to $9K — a 87% reduction in two days. The remaining $9K was legitimate active model serving.

The prevention system (auto-tagger Lambda + SCP + weekly anomaly alert) was deployed within the same week. Six weeks post-cleanup, zero new zombies had accumulated. The first SCP-denied endpoint creation attempt happened on day 12 — a data scientist creating a demo endpoint without tags — which is exactly the scenario the policy is designed to catch.

FinOps Maturity Model for ML Endpoints

Most teams land somewhere in the "Crawl" phase of this model. Here's a structured path to "Run."

Stage What You Do Tools Time to Implement
🐛 Crawl Run the detection script manually, delete zombies manually, repeat quarterly boto3 script, AWS CLI, spreadsheet 1–2 days (one-time)
🚶 Walk Auto-tagging on creation, weekly CloudWatch dashboard review, SNS alert when endpoint has 0 invocations for 14 days EventBridge + Lambda, CloudWatch alarms, SNS 1–2 weeks
🏃 Run SCP-enforced tag mandate, real-time cost attribution per ML team, automated zombie deletion workflow (with approval gate), monthly FinOps review per team AWS Organizations SCP, Cost Explorer, Step Functions approval flow, Slack integration 2–4 weeks

The jump from Walk to Run is mostly organizational, not technical. The tooling is straightforward. The hard part is getting ML teams to accept that creating an endpoint without tagging it is the same as expensing a cloud cost with no business justification. Once the SCP is in place and the first denied request happens, the culture shift is immediate.

AWS SageMaker FinOps Cloud Cost Optimization Zombie Endpoints CloudWatch EventBridge boto3 AWS Organizations

How much are zombie endpoints costing your team right now?

We'll run our detection script across your AWS accounts, identify every zombie, and give you a prioritized cleanup plan — free, no commitment. Most teams find $20K–$80K/month in savings in the first session.

Book Free Cloud Cost Audit →
// Related Articles
// DevOps
Kubernetes Cost Optimisation Checklist 2025
// MLOps
MLflow vs SageMaker MLOps 2025
// Related Service
Cloud Infrastructure & FinOps
FinOps programmes that cut AWS spend by 30-40% — cost allocation tagging, right-sizing, Spot strategy, and automated cleanup.
Learn More