Benchmarks & CostsPublished May 28, 2026Updated June 22, 20266 min readBy whattAI Editorial Team

Understanding Latency: TTFT vs. Throughput (t/s)

Why Time to First Token (TTFT) matters for conversational user interfaces, and how to evaluate real-time response speeds against total batch throughput.

Defining API Latency

When measuring LLM speed, developers must distinguish between two separate metrics: Time to First Token (TTFT) and Throughput.


Metric Breakdown

1. Time to First Token (TTFT)

  • What it is: The time it takes (in milliseconds) from sending the request to the client receiving the first output character.
  • Why it matters: Crucial for real-time conversational UIs. A high TTFT (e.g., 800ms) makes the application feel laggy and unresponsive, even if the subsequent generation is fast.
  • Best Performers: GPT-4o-mini (~180ms), Llama 3.1 8B (~190ms).

2. Throughput (Tokens per Second)

  • What it is: The rate at which the server outputs text after the generation starts.
  • Why it matters: Important for background automation, code generation, and batch jobs (summarizing large documents).
  • Best Performers: Gemini 1.5 Flash (~120 t/s), Llama 3.1 8B hosted on high-concurrency clusters (~150 t/s).

Selecting the Right Model

  • For Chatbots: Prioritize low TTFT (under 300ms) to ensure instant UI responsiveness.
  • For Code Refactoring: Prioritize high throughput (t/s) and larger context windows to feed files quickly.

Sources and Notes

Each fact in this article is grounded in the sources below. Always check vendor pages before purchase since pricing and terms can change.

OpenRouter model pricing

Put this guide into action

Turn the article into a practical recommendation with the AI Stack Builder or compare tool options directly.

Build My StackCompare Tools

Related guides

Llama 3.3 70B vs GPT-4o-mini: Best Value for Coding?

A granular cost-to-performance analysis comparing Meta's open-weights contender Llama 3.3 70B against OpenAI's flagship budget model for software development.

DeepSeek-V3 vs GPT-4o: Cost Disruption in Flagship LLMs

DeepSeek-V3's pricing models ($0.14/M input) have disrupted standard AI economics. We analyze the 15x cost discount against OpenAI's GPT-4o.

Claude 3.5 Haiku: The Price of Anthropic's Upgrade

Claude 3.5 Haiku offers impressive capability, but its 4x price premium over Claude 3 Haiku changes the calculus for lightweight tasks.