InfrastructurePublished June 10, 2026Updated June 22, 20268 min readBy whattAI Editorial Team

Open-Source Self-Hosting vs Serverless APIs: A Financial Analysis

Break down server hosting costs (AWS, RunPod) versus pay-as-you-go serverless endpoints to find your inflection point for open-weights hosting.

The Hosting Dilemma

When deploying models like Llama 3.1 70B, developers face a critical choice: use a serverless API provider (like DeepInfra, Together AI, or OpenRouter) or self-host the model on dedicated cloud GPUs (via AWS, GCP, RunPod, or Lambda Labs).

This article outlines the mathematical inflection point where self-hosting becomes financially viable.


Cost breakdown

Option A: Serverless APIs (Pay-as-you-go)

  • Average Cost: $0.70 per 1M tokens (blended input/output)
  • Monthly Cost Formula: Tokens_per_Month * $0.0000007
  • Advantages: Zero maintenance, instant scaling, no cold starts.

Option B: Dedicated GPU instances (Self-hosted)

To run a Llama 70B model in FP16 with high throughput, you require at least 2x A100 (80GB) or 4x L40S GPUs to handle KV cache overhead.

  • RunPod / Lambda Labs Dedicated A100 (80GB) Node: ~$3.20/hour per GPU = $4,600/month (24/7 run)
  • AWS EC2 g5.12xlarge (4x A10G): ~$5.67/hour = $4,080/month (on-demand)
  • Advantages: Custom model weights, guaranteed privacy, zero rate limits.

Finding the Inflection Point

To justify an on-demand cost of $4,000/month compared to serverless pricing of $0.70/1M tokens, you must process a minimum threshold of tokens per month:

Tokens Threshold = $4,000 / $0.0000007 ≈ 5,714,285,714 tokens/month

That equates to 5.71 Billion tokens/month (or roughly 2.2 tokens per millisecond of continuous, unbroken server utilisation).

Summary Analysis:
  - Monthly Volume < 5B tokens: Serverless APIs are significantly cheaper.
  - Monthly Volume > 6B tokens: Self-hosting saves money and provides dedicated execution bandwidth.

Sources and Notes

Each fact in this article is grounded in the sources below. Always check vendor pages before purchase since pricing and terms can change.

OpenRouter model pricingRunPod GPU pricingAWS EC2 pricing

Put this guide into action

Turn the article into a practical recommendation with the AI Stack Builder or compare tool options directly.

Build My StackCompare Tools

Related guides

Gemini's 2 Million Context: Cost Trap or Superpower?

Analyzing the context window pricing model of Gemini 1.5 Pro, where pricing doubles after 128k tokens, and how to manage large-context bills.

The True Cost of Retrieval-Augmented Generation (RAG)

Break down the architectural costs of RAG pipelines, including embedding generation, vector storage, and context retrieval overhead.

Local LLMs on Consumer Hardware: The 2-Year Math

Compare the cost of buying a $2,000 Mac Studio for local coding assistants (Ollama/Llama 3) vs. paying subscription or API fees over 2 years.