Open-Source Self-Hosting vs Serverless APIs: A Financial Analysis

The Hosting Dilemma

When deploying models like Llama 3.1 70B, developers face a critical choice: use a serverless API provider (like DeepInfra, Together AI, or OpenRouter) or self-host the model on dedicated cloud GPUs (via AWS, GCP, RunPod, or Lambda Labs).

This article outlines the mathematical inflection point where self-hosting becomes financially viable.

Cost breakdown

#### Option A: Serverless APIs (Pay-as-you-go) * Average Cost: $0.70 per 1M tokens (blended input/output) * Monthly Cost Formula: Tokens_per_Month * $0.0000007 * Advantages: Zero maintenance, instant scaling, no cold starts.

#### Option B: Dedicated GPU instances (Self-hosted) To run a Llama 70B model in FP16 with high throughput, you require at least 2x A100 (80GB) or 4x L40S GPUs to handle KV cache overhead. * RunPod / Lambda Labs Dedicated A100 (80GB) Node: ~$3.20/hour per GPU = $4,600/month (24/7 run) * AWS EC2 g5.12xlarge (4x A10G): ~$5.67/hour = $4,080/month (on-demand) * Advantages: Custom model weights, guaranteed privacy, zero rate limits.

Finding the Inflection Point

To justify an on-demand cost of $4,000/month compared to serverless pricing of $0.70/1M tokens, you must process a minimum threshold of tokens per month:

\text{Tokens Threshold} = \frac{\$4,000}{\$0.0000007} \approx 5,714,285,714 \text{ tokens/month}

That equates to 5.71 Billion tokens/month (or roughly 2.2 tokens per millisecond of continuous, unbroken server utilisation).

yaml
Summary Analysis:
  - Monthly Volume < 5B tokens: Serverless APIs are significantly cheaper.
  - Monthly Volume > 6B tokens: Self-hosting saves money and provides dedicated execution bandwidth.