The AWS Cost Spiral: How SageMaker Zombie Endpoints Quietly Burn $700K/yr (And How to Kill Them)
It starts with a demo. A model gets deployed to a SageMaker endpoint, the demo goes well, and the ML team moves on to the next sprint. Three months later, nobody remembers the endpoint exists — but it's still running, silently billing $0.736 per hour. Multiply that across dozens of endpoints and three AWS accounts, and you have $58,000 disappearing every month. This is the story of how we found them, killed them safely, and built the system to prevent it from happening again.
The Anatomy of a Zombie Endpoint
A SageMaker zombie endpoint is a real-time inference endpoint in InService status
with zero invocations over a sustained period — typically 30 days or more. It looks healthy in
the console. It passes all health checks. It is simply doing nothing while the billing meter runs
at full speed.
Here is how one is born. An ML engineer is preparing a product demo for stakeholders. They create
a SageMaker endpoint with an ml.g4dn.xlarge instance — a sensible choice for a
computer vision or NLP model. The demo goes well. Stakeholders are impressed. The project gets
greenlit. The engineer starts the real implementation sprint — and the demo endpoint is never
mentioned again.
Six months later, the team has trained a better model, deployed it to a new endpoint, and promoted it to production. The demo endpoint still exists. Nobody thinks to delete it because nobody's looking at it. It doesn't appear in any dashboard. It doesn't generate alerts. It just silently processes zero requests at $0.736 per hour.
Do the math: $0.736/hr × 24hr × 365 days = $6,446 per year for a single
ml.g4dn.xlarge endpoint. For larger instances like ml.g5.4xlarge used
in LLM serving, that number jumps to $31,000+ per year per zombie.
The problem compounds because ML teams operate fast. A model improvement cycle might create and abandon 3–4 endpoints per quarter. Across a team of 12 ML engineers working across 3 AWS accounts, you can accumulate 30–40 zombie endpoints in under a year without any malicious intent — just the normal entropy of a fast-moving team with no enforcement mechanisms.
Forensic Investigation: Finding Your Zombies
The fastest manual approach uses two CLI commands in sequence. First, list every InService
endpoint. Then query CloudWatch to check whether InvocationsPerInstance has been
zero for the past 30 days.
# Step 1: List all InService endpoints across a region
aws sagemaker list-endpoints \
--status-filter InService \
--query "Endpoints[*].{Name:EndpointName,Created:CreationTime,Modified:LastModifiedTime}" \
--output table \
--region us-east-1
# Step 2: Check invocations for a specific endpoint (last 30 days)
aws cloudwatch get-metric-statistics \
--namespace AWS/SageMaker \
--metric-name Invocations \
--dimensions Name=EndpointName,Value=YOUR_ENDPOINT_NAME \
Name=VariantName,Value=AllTraffic \
--start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 2592000 \
--statistics Sum \
--query "Datapoints[*].Sum" \
--output text
# If the above returns nothing (no datapoints) or 0.0, the endpoint is a zombie.
# Cross-check with Cost Explorer for the actual billing amount:
aws ce get-cost-and-usage \
--time-period Start=2025-04-01,End=2025-05-01 \
--granularity MONTHLY \
--filter '{"And":[{"Dimensions":{"Key":"SERVICE","Values":["Amazon SageMaker"]}},{"Tags":{"Key":"aws:sagemaker:endpoint-name","Values":["YOUR_ENDPOINT_NAME"]}}]}' \
--metrics BlendedCost \
--output json
The CLI approach works for spot-checking. For fleet-wide detection across dozens or hundreds of endpoints — especially across multiple accounts — you need the automation script below.
The Detection Script: Find Every Zombie in One Run
This Python script uses boto3 to scan every InService SageMaker endpoint in an
account, queries CloudWatch for 30-day invocation totals, estimates monthly cost by instance
type, and prints a ranked table of zombies sorted by monthly burn.
#!/usr/bin/env python3
"""
sagemaker_zombie_detector.py
Finds SageMaker endpoints with zero invocations in the last 30 days.
Usage: python3 sagemaker_zombie_detector.py --region us-east-1
"""
import boto3, argparse
from datetime import datetime, timezone, timedelta
from typing import List, Dict
# Approximate hourly cost per instance type (USD, on-demand, us-east-1)
INSTANCE_COSTS = {
"ml.t2.medium": 0.056, "ml.m5.large": 0.134,
"ml.m5.xlarge": 0.269, "ml.c5.xlarge": 0.238,
"ml.g4dn.xlarge": 0.736, "ml.g4dn.2xlarge": 1.218,
"ml.g5.xlarge": 1.408, "ml.g5.4xlarge": 3.553,
"ml.p3.2xlarge": 4.284,
}
def get_30d_invocations(cw, endpoint_name: str) -> float:
now = datetime.now(timezone.utc)
resp = cw.get_metric_statistics(
Namespace="AWS/SageMaker",
MetricName="Invocations",
Dimensions=[
{"Name": "EndpointName", "Value": endpoint_name},
{"Name": "VariantName", "Value": "AllTraffic"},
],
StartTime=now - timedelta(days=30),
EndTime=now,
Period=2592000, # 30 days in seconds
Statistics=["Sum"],
)
datapoints = resp.get("Datapoints", [])
return sum(d["Sum"] for d in datapoints)
def find_zombies(region: str) -> List[Dict]:
sm = boto3.client("sagemaker", region_name=region)
cw = boto3.client("cloudwatch", region_name=region)
zombies = []
paginator = sm.get_paginator("list_endpoints")
for page in paginator.paginate(StatusEquals="InService"):
for ep in page["Endpoints"]:
detail = sm.describe_endpoint(EndpointName=ep["EndpointName"])
instance_type = detail["ProductionVariants"][0].get(
"CurrentInstanceType", "unknown"
)
invocations = get_30d_invocations(cw, ep["EndpointName"])
if invocations == 0:
hourly = INSTANCE_COSTS.get(instance_type, 0.5)
created = ep["CreationTime"].replace(tzinfo=timezone.utc)
days_idle = (datetime.now(timezone.utc) - created).days
zombies.append({
"name": ep["EndpointName"],
"instance": instance_type,
"monthly_cost_usd": round(hourly * 24 * 30, 2),
"days_idle": days_idle,
})
return sorted(zombies, key=lambda x: x["monthly_cost_usd"], reverse=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--region", default="us-east-1")
args = parser.parse_args()
zombies = find_zombies(args.region)
total_monthly = sum(z["monthly_cost_usd"] for z in zombies)
print(f"\n{'ENDPOINT NAME':<45} {'INSTANCE':<20} {'$/MO':>8} {'DAYS IDLE':>10}")
print("-" * 90)
for z in zombies:
print(f"{z['name']:<45} {z['instance']:<20} {z['monthly_cost_usd']:>8.2f} {z['days_idle']:>10}")
print(f"\n Total zombie burn: ${total_monthly:,.2f}/month (${total_monthly*12:,.0f}/yr)\n")
How to Kill Them Safely
Never delete a SageMaker endpoint without first confirming the model artifacts are safely persisted. The endpoint itself is just a serving wrapper — the model files live in S3, and deleting the endpoint does not delete the model. But the endpoint configuration and any associated custom inference code can be lost if you're not careful. Follow this four-step process.
Step 1: Snapshot Model Artifacts to a Known S3 Path
# Confirm model artifacts are accessible before deleting the endpoint
aws sagemaker describe-endpoint --endpoint-name YOUR_ZOMBIE_ENDPOINT \
--query "ProductionVariants[*].{Variant:VariantName,Model:ModelArn}"
# Get model data URL from the model object
aws sagemaker describe-model --model-name YOUR_MODEL_NAME \
--query "PrimaryContainer.ModelDataUrl"
# Copy to a long-term archive prefix (belt-and-suspenders):
aws s3 cp s3://your-bucket/path/to/model.tar.gz \
s3://your-archive-bucket/zombie-archive/$(date +%Y-%m-%d)/model.tar.gz
Step 2: Tag with zombie:confirmed Before Deletion
# Tag the endpoint — creates an audit record before deletion
aws sagemaker add-tags \
--resource-arn arn:aws:sagemaker:us-east-1:123456789012:endpoint/YOUR_ZOMBIE_ENDPOINT \
--tags Key=zombie,Value=confirmed \
Key=zombie-confirmed-by,Value=ajeet.kumar@codetoday.io \
Key=zombie-confirmed-date,Value=$(date +%Y-%m-%d)
Step 3: Delete the Endpoint
# The deletion itself is a single command — irreversible, so confirm tags first
aws sagemaker delete-endpoint \
--endpoint-name YOUR_ZOMBIE_ENDPOINT \
--region us-east-1
# Confirm deletion:
aws sagemaker describe-endpoint --endpoint-name YOUR_ZOMBIE_ENDPOINT 2>&1 | grep -i "does not exist"
Step 4: Implement an AWS Config Rule for Deletion Prevention
After cleanup, prevent recurrence with an AWS Config custom rule that fires a non-compliant
finding for any SageMaker endpoint older than 90 days that lacks an auto-delete-after
tag. Non-compliant findings trigger an SNS notification to the owning team.
The Prevention System: Never Get Here Again
Detection and deletion is reactive. The real goal is a system where zombie endpoints can't accumulate in the first place. We implement three components: auto-tagging on creation, weekly anomaly alerts, and an SCP-enforced tag policy.
EventBridge + Lambda Auto-Tagger
# EventBridge rule (CloudFormation) — fires when a SageMaker endpoint is created
SageMakerEndpointCreatedRule:
Type: AWS::Events::Rule
Properties:
Name: sagemaker-endpoint-created-tagger
EventPattern:
source: ["aws.sagemaker"]
detail-type: ["AWS API Call via CloudTrail"]
detail:
eventName: ["CreateEndpoint"]
State: ENABLED
Targets:
- Arn: !GetAtt EndpointTaggerLambda.Arn
Id: EndpointTaggerTarget
---
# Lambda function body (Python 3.12)
import boto3, json
from datetime import datetime, timedelta
def handler(event, context):
sm = boto3.client('sagemaker')
detail = event['detail']
endpoint_name = detail['requestParameters']['endpointName']
created_by = detail['userIdentity']['arn']
auto_delete_after = (datetime.utcnow() + timedelta(days=90)).strftime('%Y-%m-%d')
endpoint_arn = f"arn:aws:sagemaker:{detail['awsRegion']}:{detail['recipientAccountId']}:endpoint/{endpoint_name}"
sm.add_tags(
ResourceArn=endpoint_arn,
Tags=[
{'Key': 'created-by', 'Value': created_by},
{'Key': 'last-invoked', 'Value': 'never'},
{'Key': 'auto-delete-after', 'Value': auto_delete_after},
]
)
print(f"Tagged {endpoint_name}: auto-delete-after={auto_delete_after}, created-by={created_by}")
Service Control Policy: Enforce the Tag Mandate
The SCP below, applied at the AWS Organization level, prevents any IAM principal from creating
a SageMaker endpoint unless the request includes an auto-delete-after tag. No tag,
no endpoint — the request is denied at the AWS Organizations layer before it reaches SageMaker.
# SCP — require auto-delete-after tag on all SageMaker CreateEndpoint calls
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "RequireAutoDeleteAfterTag",
"Effect": "Deny",
"Action": "sagemaker:CreateEndpoint",
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/auto-delete-after": "true"
}
}
}]
}
The $700K Story: Real Numbers From a Real Engagement
In Q1 2025, a data & AI platform team at a Series D fintech brought us in for a cloud cost review. They were running an MLOps platform across three AWS accounts: prod, staging, and experimentation. Their monthly SageMaker bill had grown from $12K to $67K over 14 months. Leadership assumed it was model traffic growth. It wasn't.
We ran the detection script across all three accounts. The results:
- 38 zombie endpoints found across the three accounts
- $58,200/month in idle endpoint costs — 87% of the total SageMaker bill
- Largest single zombie: an
ml.p3.2xlargeendpoint serving a BERT-based fraud model that was superseded 8 months prior — $3,085/month for 8 months = $24,680 total waste - Oldest zombie: 14 months old, from a POC that was shelved before it launched
across 3 accounts
before cleanup
after 48 hours
savings
Within 48 hours of detection, 36 of the 38 zombies were deleted (two required additional sign-off from data science leads who weren't available immediately). The monthly SageMaker bill dropped from $67K to $9K — a 87% reduction in two days. The remaining $9K was legitimate active model serving.
The prevention system (auto-tagger Lambda + SCP + weekly anomaly alert) was deployed within the same week. Six weeks post-cleanup, zero new zombies had accumulated. The first SCP-denied endpoint creation attempt happened on day 12 — a data scientist creating a demo endpoint without tags — which is exactly the scenario the policy is designed to catch.
FinOps Maturity Model for ML Endpoints
Most teams land somewhere in the "Crawl" phase of this model. Here's a structured path to "Run."
| Stage | What You Do | Tools | Time to Implement |
|---|---|---|---|
| 🐛 Crawl | Run the detection script manually, delete zombies manually, repeat quarterly | boto3 script, AWS CLI, spreadsheet | 1–2 days (one-time) |
| 🚶 Walk | Auto-tagging on creation, weekly CloudWatch dashboard review, SNS alert when endpoint has 0 invocations for 14 days | EventBridge + Lambda, CloudWatch alarms, SNS | 1–2 weeks |
| 🏃 Run | SCP-enforced tag mandate, real-time cost attribution per ML team, automated zombie deletion workflow (with approval gate), monthly FinOps review per team | AWS Organizations SCP, Cost Explorer, Step Functions approval flow, Slack integration | 2–4 weeks |
The jump from Walk to Run is mostly organizational, not technical. The tooling is straightforward. The hard part is getting ML teams to accept that creating an endpoint without tagging it is the same as expensing a cloud cost with no business justification. Once the SCP is in place and the first denied request happens, the culture shift is immediate.
How much are zombie endpoints costing your team right now?
We'll run our detection script across your AWS accounts, identify every zombie, and give you a prioritized cleanup plan — free, no commitment. Most teams find $20K–$80K/month in savings in the first session.
Book Free Cloud Cost Audit →