The Hosting Dilemma
When deploying models like Llama 3.1 70B, developers face a critical choice: use a serverless API provider (like DeepInfra, Together AI, or OpenRouter) or self-host the model on dedicated cloud GPUs (via AWS, GCP, RunPod, or Lambda Labs).
This article outlines the mathematical inflection point where self-hosting becomes financially viable.
Cost breakdown
Option A: Serverless APIs (Pay-as-you-go)
- Average Cost: $0.70 per 1M tokens (blended input/output)
- Monthly Cost Formula:
Tokens_per_Month * $0.0000007 - Advantages: Zero maintenance, instant scaling, no cold starts.
Option B: Dedicated GPU instances (Self-hosted)
To run a Llama 70B model in FP16 with high throughput, you require at least 2x A100 (80GB) or 4x L40S GPUs to handle KV cache overhead.
- RunPod / Lambda Labs Dedicated A100 (80GB) Node: ~$3.20/hour per GPU = $4,600/month (24/7 run)
- AWS EC2 g5.12xlarge (4x A10G): ~$5.67/hour = $4,080/month (on-demand)
- Advantages: Custom model weights, guaranteed privacy, zero rate limits.
Finding the Inflection Point
To justify an on-demand cost of $4,000/month compared to serverless pricing of $0.70/1M tokens, you must process a minimum threshold of tokens per month:
Tokens Threshold = $4,000 / $0.0000007 ≈ 5,714,285,714 tokens/month
That equates to 5.71 Billion tokens/month (or roughly 2.2 tokens per millisecond of continuous, unbroken server utilisation).
Summary Analysis:
- Monthly Volume < 5B tokens: Serverless APIs are significantly cheaper.
- Monthly Volume > 6B tokens: Self-hosting saves money and provides dedicated execution bandwidth.