The True Cost of Retrieval-Augmented Generation (RAG)

Beyond the LLM Call

Retrieval-Augmented Generation (RAG) is the industry standard for grounding LLMs on custom documentation. However, calculating RAG bills requires looking beyond the core generation API call. The pipeline introduces three additional cost layers:

RAG Cost Layers

1. Embedding Costs

Before documents can be searched, they must be converted into vector embeddings.

OpenAI text-embedding-3-small: $0.02 / 1M tokens.
OpenAI text-embedding-3-large: $0.13 / 1M tokens.
Note: Embedding is a one-time write cost per document revision, plus embedding the user's query at runtime.

2. Vector Database Hosting

Storing and querying vector embeddings requires specialized databases.

Pinecone Serverless: ~$0.07 per GB of storage, plus read/write index units.
Dedicated pgvector (AWS RDS): ~$50 - $300/month depending on instance memory.

3. LLM Prompt Inflation

Retrieving context documents inflates the input prompt. If you retrieve 5 context snippets averaging 500 tokens each, you append 2,500 tokens of input to every query. At $3.00/1M tokens, this adds $0.0075 per query in input inflation alone.

The True Cost of Retrieval-Augmented Generation (RAG)

Beyond the LLM Call

RAG Cost Layers

1. Embedding Costs

2. Vector Database Hosting

3. LLM Prompt Inflation

Sources and Notes

Put this guide into action

Related guides

Open-Source Self-Hosting vs Serverless APIs: A Financial Analysis

Gemini's 2 Million Context: Cost Trap or Superpower?

Local LLMs on Consumer Hardware: The 2-Year Math