Cloud bills don't lie, but they do hide. Every month, money quietly drains out of your AWS, GCP, or Azure account through decisions that made perfect sense at the time. A database sized for peak Black Friday traffic. A NAT gateway nobody remembered to delete. Logs that have been rotating into S3 for three years and nobody has ever opened.
This isn't about big mistakes. It's about the slow, invisible tax that accumulates until your CFO asks a question you can't answer.
Let me show you exactly where the money goes, and how to stop the bleeding.
Problem #1: You're Paying for Availability You Don't Need
Multi-AZ RDS. Cross-region replication. Active-active load balancers across three data centers. These sound responsible. They are, for some workloads. But I've watched startups with 200 daily active users pay $4,000/month for infrastructure that would survive a nuclear strike.
The question nobody asks is: what does downtime actually cost us?
If your answer is "maybe a few angry emails," you don't need five-nines. You need to do the math:
Monthly downtime cost = (hourly revenue loss) × (acceptable downtime hours)
A SaaS charging $50k ARR across 500 users can tolerate two hours of downtime before the cost matches a single month of redundancy infrastructure. Size your reliability to your actual risk, not to what AWS makes easy to check.
Problem #2: NAT Gateways Are a Silent Budget Killer
This one will make you angry. NAT gateway pricing on AWS is $0.045 per GB of data processed. Doesn't sound like much. Then you realize your application is routing all internal traffic — service-to-service calls, S3 fetches, DynamoDB queries — through that gateway.
The fix? Use VPC endpoints for AWS services. S3, DynamoDB, SQS, and dozens of others support Gateway or Interface endpoints that route traffic privately, without touching your NAT gateway. A single S3 VPC endpoint on a data-heavy workload can cut hundreds of dollars per month in transit fees.
Before: EC2 → NAT Gateway ($0.045/GB) → S3
After: EC2 → VPC Endpoint (free) → S3
Check your VPC Flow Logs. If your NAT gateway is processing more than a few gigabytes per day, there's almost certainly money to recover here.
Problem #3: Your Instances Are the Wrong Shape, Not the Wrong Size
Teams obsess over downsizing instances. They ignore instance family entirely. This is backwards.
A c6i.2xlarge (compute-optimized) costs roughly the same as a r6i.xlarge (memory-optimized) but has twice the vCPUs and half the RAM. If you're running a CPU-bound API server on an r-family instance because it was the default, you're paying a memory premium for resources your application never touches.
Run CloudWatch metrics for a week on your production instances. If memory utilization sits below 40% consistently, you're in the wrong family. If CPU is constantly pegged but memory is idle, you're definitely in the wrong family.
| Workload | Right Family | Wrong Family |
|---|---|---|
| API servers, batch jobs | c (compute) | r (memory) |
| In-memory caches, databases | r (memory) | c (compute) |
| ML inference | inf or g | anything general |
The right instance shape can give you better performance and a lower bill simultaneously.
Problem #4: Idle Resources Are Running 24/7 in Production
Development databases. Staging environments. Load testing clusters that ran once in January. QA environments that mirror production "just in case." They're all sitting there, at full price, doing absolutely nothing at 3am on a Sunday.
The fix is embarrassingly simple: scheduled shutdowns.
# AWS EventBridge rule to stop RDS every night at 8pm
aws events put-rule \
--schedule-expression "cron(0 20 * * ? *)" \
--name StopDevDatabase
# Restart at 8am
aws events put-rule \
--schedule-expression "cron(0 8 * * ? *)" \
--name StartDevDatabase
Non-production environments that run 24/7 are environments running 168 hours a week. Constrain them to business hours — 50 hours a week — and you've just cut that environment's compute cost by 70% overnight, without changing a single line of application code.
Problem #5: Data Transfer Fees Are the Fine Print Nobody Reads
Compute is priced prominently. Data transfer is buried in a footnote until your bill arrives.
The rules that catch most teams off-guard:
- Inter-AZ traffic costs money. Two EC2 instances in the same region but different availability zones are charged per GB, both directions.
- Egress to the internet is expensive. Ingress is free. If your application serves large files directly from EC2, you're paying egress on every byte.
- Cross-region replication is egress. Replicating your RDS snapshot from
us-east-1toeu-west-1for disaster recovery? That's a per-GB charge, every time.
The architectural response: push data delivery to CloudFront (or equivalent CDNs). Egress from CloudFront is cheaper than egress from EC2 or S3 directly, and CDN cache hits mean you're not re-serving the same bytes at full price repeatedly.
Problem #6: You're Using On-Demand Pricing for Predictable Workloads
On-demand instances exist for unpredictable, spiky workloads. Your production API server that handles consistent traffic 24 hours a day is not that.
Reserved Instances and Savings Plans offer discounts of 30-60% over on-demand for committing to one or three years. Yet most teams either don't use them, or use them incorrectly — buying reservations for instance types they then change six months later.
The pragmatic approach:
- Run on-demand for 2-3 months on new workloads
- Analyze your baseline usage in Cost Explorer
- Buy Compute Savings Plans (not instance-specific reservations) — they apply to any EC2 instance family and Fargate automatically
- Cover only your baseline with commitments; use Spot for burst capacity above that
A Compute Savings Plan on a steady-state workload is not a lock-in risk. It's just arithmetic.
Problem #7: Logs Are Stored Like They're Made of Gold
Logs are cheap to generate and expensive to store forever. Yet most teams pipe everything into CloudWatch Logs or S3 with no retention policy, no tiering, and no expiry. Three years later, the storage bill is a significant line item and nobody has opened a log file from 2021 once.
Set retention policies immediately:
# Set CloudWatch log group retention to 30 days
aws logs put-retention-policy \
--log-group-name /app/production \
--retention-in-days 30
For logs you genuinely need to keep — compliance, audit trails, security events — tier them aggressively:
0–30 days: S3 Standard (hot access)
30–90 days: S3 Infrequent Access (60% cheaper)
90+ days: S3 Glacier Instant Retrieval (80% cheaper)
You do not need millisecond access to a log from fourteen months ago. Stop paying for it.
The Bill You Should Actually Be Reading
Most engineers look at the total cloud bill and feel vaguely guilty. The right habit is different: look at the unit economics.
Cost per API request
Cost per active user per month
Cost per GB stored per year
These numbers expose inefficiency that raw dollar amounts hide. If your cost per API request is rising while traffic is flat, something is wrong — and it's almost never the obvious thing.
Your cloud infrastructure isn't just infrastructure. It's a slow drain or a tight ship, depending entirely on the decisions you make about things that seem too small to matter.
They compound. They always do.