Understanding Latency: TTFT vs. Throughput (t/s)

Defining API Latency

When measuring LLM speed, developers must distinguish between two separate metrics: Time to First Token (TTFT) and Throughput.

Metric Breakdown

#### 1. Time to First Token (TTFT) * What it is: The time it takes (in milliseconds) from sending the request to the client receiving the first output character. * Why it matters: Crucial for real-time conversational UIs. A high TTFT (e.g., 800ms) makes the application feel laggy and unresponsive, even if the subsequent generation is fast. * Best Performers: GPT-4o-mini (~180ms), Llama 3.1 8B (~190ms).

#### 2. Throughput (Tokens per Second) * What it is: The rate at which the server outputs text after the generation starts. * Why it matters: Important for background automation, code generation, and batch jobs (summarizing large documents). * Best Performers: Gemini 1.5 Flash (~120 t/s), Llama 3.1 8B hosted on high-concurrency clusters (~150 t/s).

Selecting the Right Model

For Chatbots: Prioritize low TTFT (under 300ms) to ensure instant UI responsiveness.
For Code Refactoring: Prioritize high throughput (t/s) and larger context windows to feed files quickly.