Chip Bench
Pages
Benchmarking Guide

Benchmarking Guide

This benchmarking guide is dynamic and will be updated as the usage and capabilities of AI accelerators evolve. The objective is to capture the comparative performance and economics of accelerators in commercially representative settings. Inevitably this involves subjectivity. While ChipBench retains full discretion in defining benchmarks, accelerator manufacturers are encouraged to provide input and recommendations on this guide.

Accelerator Coverage

In choosing accelerators to benchmark, significant weight is given to those accelerators being available to the median developer on-demand.

Initial Coverage:
Nvidia	H100 SXM	H200 SXM
AMD	MI300X
Google TPUs	v6e-8

Upcoming:
Trainium	TRN1	INF2
Nvidia	B200

The following accelerators are also being researched:

Intel Gaudi family
Trainium 2 - capacity is currently unavailable.

Use Cases

ChipBench aims to provide coverage of one primary use case, as well as limited coverage of secondary use case. Currently ChipBench is focused on Inference applications, with plans to expand to training in the future.

Non-reasoning

Parameter	Value
Input length \| tokens	1,000 ± 50
Output length \| tokens	1,000 ± 50

Reasoning

Parameter	Value
Input length \| tokens	1,000 ± 50
Output length \| tokens	5,000 ± 250

Notes:

Values above are approximately based on discussions with inference and hardware providers and subject to change.

Benchmarking Library

Benchmarking is performed with llm-perf with:

random input sequences of text (except in the case of the long-context benchmark with repeated inputs).
input and output lengths following a normal distribution with a standard deviation of 5% of mean.
concurrency levels increasing in multiples of 8 (1, 8, 64, 256, 1024) until kv-cache capacity is exceeded or throughput flatlines.

Inference Library

Inference is run using vLLM.

Hyper-parameters (e.g. tensor parallel, number of cards per node) are chosen according to recommendations accelerator manufacturer documentation available online, or from practicioners. Accelerator manufacturers may choose to submit hyper-parameter recommendations provided these a) on-demand accelerators are available supporting these configurations, and b) those hyper-parameters may be published (necessary for reproducibility).

Representative Models

Benchmarking will be done on open source models with the objective of capturing the performance of leading edge “quality” and “speed” models:

Category	Commercial models	Open Source Model
Quality	GPT-5, Opus, Gemini-Pro	DeepSeek v3
Speed	GPT-5-mini, Sonnet, Gemini-Flash	Llama 3.3 70B FP8

*Sonnet is likely larger in size and perhaps Llama 3 70B is more representative.

Throughput Benchmarking

Quantitative metrics include:

Time to first token per request
Time per output token per request
Token Throughput

Economic Benchmarking

In calculating accelerator unit costs per million tokens, hourly on-demand (not spot) rental prices will be used. For a provider’s rental price to be considered, capacity must be readily available on a reliable and ongoing basis. A range of prices will typically be presented, along with a rationale for the chosen “representative price”.

Soft Benchmarks

Commentary will also be provided on softer factors including:

Startup time - from initial instance rental to the readiness of a hosted endpoint.
Clarity of documentation and difficulty of configuration.
Accelerator availability.