Benchmarking Guide

This benchmarking guide is dynamic and will be updated as the usage and capabilities of AI accelerators evolve. The objective is to capture the comparative performance and economics of accelerators in commercially representative settings. Inevitably this involves subjectivity. While ChipBench retains full discretion in defining benchmarks, accelerator manufacturers are encouraged to provide input and recommendations on this guide.

Accelerator Coverage

In choosing accelerators to benchmark, significant weight is given to those accelerators being available to the median developer on-demand.

Initial Coverage:

Nvidia

H100 SXM

H200 SXM

AMD

MI300X

Google TPUs

v6e-8

Upcoming:

Trainium

TRN1

INF2

Nvidia

B200

The following accelerators are also being researched:

  • Intel Gaudi family

  • Trainium 2 - capacity is currently unavailable.

Use Cases

ChipBench aims to provide coverage of one primary use case, as well as limited coverage of secondary use case. Currently ChipBench is focused on Inference applications, with plans to expand to training in the future.

Non-reasoning

Parameter

Value

Input length | tokens

1,000 ± 50

Output length | tokens

1,000 ± 50

Reasoning

Parameter

Value

Input length | tokens

1,000 ± 50

Output length | tokens

5,000 ± 250

Notes:

  • Values above are approximately based on discussions with inference and hardware providers and subject to change.

Benchmarking Library

Benchmarking is performed with llm-perf with:

  • random input sequences of text (except in the case of the long-context benchmark with repeated inputs).

  • input and output lengths following a normal distribution with a standard deviation of 5% of mean.

  • concurrency levels increasing in multiples of 8 (1, 8, 64, 256, 1024) until kv-cache capacity is exceeded or throughput flatlines.

Inference Library

Inference is run using vLLM .

Hyper-parameters (e.g. tensor parallel, number of cards per node) are chosen according to recommendations accelerator manufacturer documentation available online, or from practicioners. Accelerator manufacturers may choose to submit hyper-parameter recommendations provided these a) on-demand accelerators are available supporting these configurations, and b) those hyper-parameters may be published (necessary for reproducibility).

Representative Models

Benchmarking will be done on open source models with the objective of capturing the performance of leading edge “quality” and “speed” models:

Category

Commercial models

Open Source Model

Quality

GPT-5, Opus, Gemini-Pro

DeepSeek v3

Speed

GPT-5-mini, Sonnet, Gemini-Flash

Llama 3.3 70B FP8

*Sonnet is likely larger in size and perhaps Llama 3 70B is more representative.

Throughput Benchmarking

Quantitative metrics include:

  • Time to first token per request

  • Time per output token per request

  • Token Throughput

Economic Benchmarking

In calculating accelerator unit costs per million tokens, hourly on-demand (not spot) rental prices will be used. For a provider’s rental price to be considered, capacity must be readily available on a reliable and ongoing basis. A range of prices will typically be presented, along with a rationale for the chosen “representative price”.

Soft Benchmarks

Commentary will also be provided on softer factors including:

  • Startup time - from initial instance rental to the readiness of a hosted endpoint.

  • Clarity of documentation and difficulty of configuration.

  • Accelerator availability.