forked from mlcommons/inference
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from anandhu-eng/cm_readme_inference_update
Cm readme inference update
- Loading branch information
Showing
4 changed files
with
147 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,124 @@ | ||
# MLPerf Inference Benchmarks | ||
|
||
## Overview | ||
This document provides details on various MLPerf Inference Benchmarks categorized by tasks, models, and datasets. Each section lists the models performing similar tasks, with details on datasets, accuracy, and server latency constraints. | ||
|
||
--- | ||
hide: | ||
- toc | ||
|
||
## 1. Image Classification | ||
### [ResNet50-v1.5](image_classification/resnet50.md) | ||
- **Dataset**: Imagenet-2012 (224x224) Validation | ||
- **Size**: 50,000 | ||
- **QSL Size**: 1,024 | ||
- **Reference Model Accuracy**: 76.46% | ||
- **Server Scenario Latency Constraint**: 15ms | ||
|
||
--- | ||
|
||
# MLPerf Inference Benchmarks | ||
## 2. Text to Image | ||
### [Stable Diffusion](text_to_image/sdxl.md) | ||
- **Dataset**: Subset of Coco2014 | ||
- **Size**: 5,000 | ||
- **QSL Size**: 5,000 | ||
- **Required Accuracy (Closed Division)**: | ||
- FID: 23.01085758 ≤ FID ≤ 23.95007626 | ||
- CLIP: 32.68631873 ≤ CLIP ≤ 31.81331801 | ||
|
||
Please visit the individual benchmark links to see the run commands using the unified CM interface. | ||
--- | ||
|
||
1. [Image Classification](image_classification/resnet50.md) using ResNet50-v1.5 model and Imagenet-2012 (224x224) validation dataset. Dataset size is 50,000 and QSL size is 1024. Reference model accuracy is 76.46%. Server scenario latency constraint is 15ms. | ||
## 3. Object Detection | ||
### [Retinanet](object_detection/retinanet.md) | ||
- **Dataset**: OpenImages | ||
- **Size**: 24,781 | ||
- **QSL Size**: 64 | ||
- **Reference Model Accuracy**: 0.3755 mAP | ||
- **Server Scenario Latency Constraint**: 100ms | ||
|
||
2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and subset of Coco2014 dataset. Dataset size is 5000 amd QSL size is the same. Required accuracy for closed division is (23.01085758 <= FID <= 23.95007626, 32.68631873 <= CLIP <= 31.81331801). | ||
--- | ||
|
||
3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset.Dataset size is 24781 and QSL size is 64. Reference model accuracy is 0.3755 mAP. Server scenario latency constraint is 100ms. | ||
## 4. Medical Image Segmentation | ||
### [3d-unet](medical_imaging/3d-unet.md) | ||
- **Dataset**: KiTS2019 | ||
- **Size**: 42 | ||
- **QSL Size**: 42 | ||
- **Reference Model Accuracy**: 0.86330 Mean DICE Score | ||
- **Server Scenario**: Not Applicable | ||
|
||
4. [Medical Image Segmentation](medical_imaging/3d-unet.md) using 3d-unet model and KiTS2019 dataset. Dataset size is 42 and QSL size is the same. Reference model accuracy is 0.86330 mean DIXE score. Server scenario is not applicable. | ||
--- | ||
|
||
5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset with 384 sequence length. Dataset size is 10833 and QSL size is the same. Reference model accuracy is f1 score = 90.874%. Server scenario latency constraint is 130ms. | ||
## 5. Language Tasks | ||
|
||
6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail v3.0.0 dataset. Dataset size is 13368 amd QSL size is the same. Reference model accuracy is (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881, gen_len=4016878). Server scenario latency sconstraint is 20s. | ||
### 5.1. Question Answering | ||
|
||
7. [Question Answering](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA (GPT-4 split, max_seq_len=1024) dataset. Dataset size is 24576 and QSL size is the same. Reference model accuracy is (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms. | ||
### [Bert-Large](language/bert.md) | ||
- **Dataset**: Squad v1.1 (384 Sequence Length) | ||
- **Size**: 10,833 | ||
- **QSL Size**: 10,833 | ||
- **Reference Model Accuracy**: F1 Score = 90.874% | ||
- **Server Scenario Latency Constraint**: 130ms | ||
|
||
8. [Question Answering, Math and Code Generation](language/mixtral-8x7b.md) using Mixtral-8x7B model and OpenORCA (5k samples of GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) datasets. Dataset size is 15000 and QSL size is the same. Reference model accuracy is (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, gsm8k accuracy = 73.78, mbxp accuracy = 60.12, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms. | ||
### [LLAMA2-70B](language/llama2-70b.md) | ||
- **Dataset**: OpenORCA (GPT-4 split, max_seq_len=1024) | ||
- **Size**: 24,576 | ||
- **QSL Size**: 24,576 | ||
- **Reference Model Accuracy**: | ||
- Rouge1: 44.4312 | ||
- Rouge2: 22.0352 | ||
- RougeL: 28.6162 | ||
- Tokens_per_sample: 294.45 | ||
- **Server Scenario Latency Constraint**: | ||
- TTFT: 2000ms | ||
- TPOT: 200ms | ||
|
||
9. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Synthetic Multihot Criteo dataset. Dataset size is 204800 and QSL size is the same. Reference model accuracy is AUC=80.31%. Server scenario latency constraint is 60 ms. | ||
### 5.2. Text Summarization | ||
|
||
### [GPT-J](language/gpt-j.md) | ||
- **Dataset**: CNN Daily Mail v3.0.0 | ||
- **Size**: 13,368 | ||
- **QSL Size**: 13,368 | ||
- **Reference Model Accuracy**: | ||
- Rouge1: 42.9865 | ||
- Rouge2: 20.1235 | ||
- RougeL: 29.9881 | ||
- Gen_len: 4,016,878 | ||
- **Server Scenario Latency Constraint**: 20s | ||
|
||
### 5.3. Mixed Tasks (Question Answering, Math, and Code Generation) | ||
|
||
### [Mixtral-8x7B](language/mixtral-8x7b.md) | ||
- **Datasets**: | ||
- OpenORCA (5k samples of GPT-4 split, max_seq_len=2048) | ||
- GSM8K (5k samples of the validation split, max_seq_len=2048) | ||
- MBXP (5k samples of the validation split, max_seq_len=2048) | ||
- **Size**: 15,000 | ||
- **QSL Size**: 15,000 | ||
- **Reference Model Accuracy**: | ||
- Rouge1: 45.4911 | ||
- Rouge2: 23.2829 | ||
- RougeL: 30.3615 | ||
- GSM8K Accuracy: 73.78% | ||
- MBXP Accuracy: 60.12% | ||
- Tokens_per_sample: 294.45 | ||
- **Server Scenario Latency Constraint**: | ||
- TTFT: 2000ms | ||
- TPOT: 200ms | ||
|
||
--- | ||
|
||
## 6. Recommendation | ||
### [DLRMv2](recommendation/dlrm-v2.md) | ||
- **Dataset**: Synthetic Multihot Criteo | ||
- **Size**: 204,800 | ||
- **QSL Size**: 204,800 | ||
- **Reference Model Accuracy**: AUC = 80.31% | ||
- **Server Scenario Latency Constraint**: 60ms | ||
|
||
--- | ||
|
||
All the nine benchmarks can participate in the datacenter category. | ||
All the nine benchmarks except DLRMv2, LLAMA2 and Mixtral-8x7B and can participate in the edge category. | ||
### Participation Categories | ||
- **Datacenter Category**: All nine benchmarks can participate. | ||
- **Edge Category**: All benchmarks except DLRMv2, LLAMA2, and Mixtral-8x7B can participate. | ||
|
||
`bert`, `llama2-70b`, `dlrm_v2` and `3d-unet` has a high accuracy (99.9%) variant, where the benchmark run must achieve a higher accuracy of at least `99.9%` of the FP32 reference model | ||
in comparison with the `99%` default accuracy requirement. | ||
### High Accuracy Variants | ||
- **Benchmarks**: `bert`, `llama2-70b`, `dlrm_v2`, and `3d-unet` | ||
- **Requirement**: Must achieve at least 99.9% of the FP32 reference model accuracy, compared to the default 99% accuracy requirement. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters