- Chip Bench
- Pages
- Benchmarking Guide
Benchmarking Guide
This benchmarking guide is dynamic and will be updated as the usage and capabilities of AI accelerators evolve. The objective is to capture the comparative performance and economics of accelerators in commercially representative settings. Inevitably this involves subjectivity. While ChipBench retains full discretion in defining benchmarks, accelerator manufacturers are encouraged to provide input and recommendations on this guide.
Accelerator Coverage
In choosing accelerators to benchmark, significant weight is given to those accelerators being available to the median developer on-demand.
Initial Coverage: | ||
---|---|---|
Nvidia | H100 SXM | H200 SXM |
AMD | MI300X | |
Google TPUs | v6e-8 |
Upcoming: | ||
---|---|---|
Trainium | TRN1 | INF2 |
Nvidia | B200 |
The following accelerators are also being researched:
Intel Gaudi family
Trainium 2 - capacity is currently unavailable.
Use Cases
ChipBench aims to provide coverage of one primary use case, as well as limited coverage of secondary use case. Currently ChipBench is focused on Inference applications, with plans to expand to training in the future.
Non-reasoning
Parameter | Value |
---|---|
Input length | tokens | 1,000 ± 50 |
Output length | tokens | 1,000 ± 50 |
Reasoning
Parameter | Value |
---|---|
Input length | tokens | 1,000 ± 50 |
Output length | tokens | 5,000 ± 250 |
Notes:
Values above are approximately based on discussions with inference and hardware providers and subject to change.
Benchmarking Library
Benchmarking is performed with llm-perf with:
random input sequences of text (except in the case of the long-context benchmark with repeated inputs).
input and output lengths following a normal distribution with a standard deviation of 5% of mean.
concurrency levels increasing in multiples of 8 (1, 8, 64, 256, 1024) until kv-cache capacity is exceeded or throughput flatlines.
Inference Library
Inference is run using vLLM .
Hyper-parameters (e.g. tensor parallel, number of cards per node) are chosen according to recommendations accelerator manufacturer documentation available online, or from practicioners. Accelerator manufacturers may choose to submit hyper-parameter recommendations provided these a) on-demand accelerators are available supporting these configurations, and b) those hyper-parameters may be published (necessary for reproducibility).
Representative Models
Benchmarking will be done on open source models with the objective of capturing the performance of leading edge “quality” and “speed” models:
Category | Commercial models | Open Source Model |
---|---|---|
Quality | GPT-5, Opus, Gemini-Pro | |
Speed | GPT-5-mini, Sonnet, Gemini-Flash |
*Sonnet is likely larger in size and perhaps Llama 3 70B is more representative.
Throughput Benchmarking
Quantitative metrics include:
Time to first token per request
Time per output token per request
Token Throughput
Economic Benchmarking
In calculating accelerator unit costs per million tokens, hourly on-demand (not spot) rental prices will be used. For a provider’s rental price to be considered, capacity must be readily available on a reliable and ongoing basis. A range of prices will typically be presented, along with a rationale for the chosen “representative price”.
Soft Benchmarks
Commentary will also be provided on softer factors including:
Startup time - from initial instance rental to the readiness of a hosted endpoint.
Clarity of documentation and difficulty of configuration.
Accelerator availability.