· 8 min read

One AWS Organization Was Leaking $104,724 a Year. Nothing Was Broken.

The scan took 1 minute and 52 seconds. One AWS organization, 12 regions, 123 rules. It came back with 93 findings worth $104,724 a year in recoverable spend.

Here’s the part that matters: nothing was broken. No outage, no bill spike, no angry email from finance. Every one of those resources was healthy and billing. That’s exactly why nobody had looked. Working and optimized are different things, and the gap between them in this org was eight thousand dollars a month.

I’ve spent ten years running AWS infrastructure, and I built CostPatrol to automate the checks I kept doing by hand. This post is the anatomy of that one scan: where the money actually goes, with the real findings and the commands that fix them.

The shape of the waste

Six findings carried the dollars. The other 87 were micro-findings, individually too small to be worth a human’s time, which is why the scanner suppresses them instead of burying you in noise.

FindingMonthlyYearly
17 Aurora clusters, only 3 serving traffic$6,496$77.9K
3 DynamoDB tables on the wrong capacity mode$1,168$14.0K
Aurora storage not on I/O-Optimized for its profile$520$6.2K
One EBS volume, unattached for 1,790 days$284$3.4K
4,688 GB routed through NAT, routable free$190$2.3K
An m5.large at under 2% CPU for 14 days$69$0.8K

Notice what’s NOT on this list. No exotic misconfiguration, no obscure service. Databases, storage, network defaults, and one forgotten instance. The same four categories I find everywhere.

Databases: $8,184 a month, most of it staging

The single biggest leak was 17 Aurora PostgreSQL clusters across 4 regions with only 3 serving traffic. $6,496 a month.

Seventeen clusters don’t appear overnight. Someone spins up a copy for a migration test. A team clones staging for a load test and moves on. A proof of concept gets a database and the proof of concept dies but the database doesn’t. Each cluster made sense the day it was created. Nobody owns the day after.

The check is one command:

aws rds describe-db-clusters \
  --query "DBClusters[].{ID:DBClusterIdentifier,Status:Status,Engine:Engine}" \
  --output table

Then look at connections per cluster in CloudWatch. Zero connections for 14 days is not a database, it’s a subscription.

Two smaller database leaks rode along. Three DynamoDB tables sat on the wrong capacity mode for their traffic, $1,168 a month. And one production Aurora cluster with heavy I/O wasn’t on I/O-Optimized storage, $520 a month for a setting most teams don’t know exists. That last one stings because the fix is a single modification, no downtime.

Storage: the volume that outlived four re-orgs

One EBS volume had been unattached for 1,790 days. Almost five years of paying for a disk connected to nothing. It was part of $284 a month found in a single region, alongside idle RDS instances and orphaned Elastic IPs.

Test resources become permanent costs the moment nobody cleans up. The instance gets terminated, the volume survives, and from that day it’s pure rent. At $0.08 to $0.10 per GB it never costs enough to trigger an alarm. It just compounds.

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query "Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime}" \
  --output table

Run that in every region you’ve ever used. I’ve yet to see a mature account where it comes back empty.

Network: the default that bills by the gigabyte

The org pushed 4,688 GB through a NAT Gateway that could have routed free. NAT data processing is $0.045/GB, which looks like nothing until a chatty service talks to S3 through it all day. $190 a month here, and I’ve seen the same pattern burn $2K a month elsewhere.

The fix is VPC Gateway Endpoints for S3 and DynamoDB. They’re free and faster. AWS will not set them up for you, because the default route just works, and the default route is what ships.

Compute: the sensor nobody read

An m5.large ran for two weeks at under 2% CPU. A monitoring sensor that had stopped reporting anything useful. $69 a month, $828 a year, for a machine whose job had ended without telling anyone.

Small number, but it’s the purest example of the whole problem: the instance was GREEN on every dashboard. Health checks passed. It was perfectly operational and completely pointless.

The three patterns underneath all of it

After enough of these scans, the findings stop looking random. The same three forces produce almost every dollar:

“It works, don’t touch it.” The 17 Aurora clusters ran fine. The sensor was healthy. Working infrastructure repels scrutiny, which means the most expensive waste is always the most stable.

Defaults are generous and silent. NAT routing, gp2 volumes, log groups that never expire, send-everything metric streams. AWS defaults optimize for “it works on day one,” and every one of them bills quietly forever until someone goes back to tighten it.

Temporary resources have no expiry date. The 1,790-day volume started as somebody’s afternoon experiment. Five years later it was still on the invoice. Nothing in AWS asks “are you done with this?”

Why the bill never flagged it

$8,727 a month and not a single alert fired. Neither tool failed. Neither has this job. AWS Budgets watches for spend crossing a threshold, and Cost Explorer explains totals after the fact. Steady waste does neither. It was in last month’s bill and the bill before that, so it looks like baseline. The anomaly detector has nothing to detect.

Structural waste hides in plain sight precisely because it’s stable.

Check your own account

You don’t need a tool to start. The three commands above, plus ten more I keep in a free CLI guide, will surface the common patterns in under an hour. Real money, no spend required.

The catch is the second month. Manual audits catch waste once, and new waste starts accumulating the day after. That’s the gap CostPatrol closes: the scan in this post ran 123 rules across 12 regions in under two minutes, through a read-only role, and every finding lands with the dollar amount and the exact fix command. Free under $5K/month of spend.

Either way, run the volume check today. 1,790 days started as day one.

You've been meaning to do this.

It takes 2 minutes. If there's nothing to find, you lost a coffee break.