The True Cost of Retrieval-Augmented Generation (RAG)

Beyond the LLM Call

Retrieval-Augmented Generation (RAG) is the industry standard for grounding LLMs on custom documentation. However, calculating RAG bills requires looking beyond the core generation API call. The pipeline introduces three additional cost layers:

RAG Cost Layers

#### 1. Embedding Costs Before documents can be searched, they must be converted into vector embeddings. * OpenAI text-embedding-3-small: $0.02 / 1M tokens. * OpenAI text-embedding-3-large: $0.13 / 1M tokens. * *Note: Embedding is a one-time write cost per document revision, plus embedding the user's query at runtime.*

#### 2. Vector Database Hosting Storing and querying vector embeddings requires specialized databases. * Pinecone Serverless: ~$0.07 per GB of storage, plus read/write index units. * Dedicated pgvector (AWS RDS): ~$50 - $300/month depending on instance memory.

#### 3. LLM Prompt Inflation Retrieving context documents inflates the input prompt. If you retrieve 5 context snippets averaging 500 tokens each, you append 2,500 tokens of input to every query. At $3.00/1M tokens, this adds $0.0075 per query in input inflation alone.