Skip to content

Commit

Permalink
Merge pull request #6 from anandhu-eng/cm_readme_inference_update
Browse files Browse the repository at this point in the history
Cm readme inference update
  • Loading branch information
anandhu-eng authored Aug 13, 2024
2 parents a6429f3 + 7e37072 commit acf5591
Show file tree
Hide file tree
Showing 4 changed files with 147 additions and 35 deletions.
126 changes: 109 additions & 17 deletions docs/benchmarks/index.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,124 @@
# MLPerf Inference Benchmarks

## Overview
This document provides details on various MLPerf Inference Benchmarks categorized by tasks, models, and datasets. Each section lists the models performing similar tasks, with details on datasets, accuracy, and server latency constraints.

---
hide:
- toc

## 1. Image Classification
### [ResNet50-v1.5](image_classification/resnet50.md)
- **Dataset**: Imagenet-2012 (224x224) Validation
- **Size**: 50,000
- **QSL Size**: 1,024
- **Reference Model Accuracy**: 76.46%
- **Server Scenario Latency Constraint**: 15ms

---

# MLPerf Inference Benchmarks
## 2. Text to Image
### [Stable Diffusion](text_to_image/sdxl.md)
- **Dataset**: Subset of Coco2014
- **Size**: 5,000
- **QSL Size**: 5,000
- **Required Accuracy (Closed Division)**:
- FID: 23.01085758 ≤ FID ≤ 23.95007626
- CLIP: 32.68631873 ≤ CLIP ≤ 31.81331801

Please visit the individual benchmark links to see the run commands using the unified CM interface.
---

1. [Image Classification](image_classification/resnet50.md) using ResNet50-v1.5 model and Imagenet-2012 (224x224) validation dataset. Dataset size is 50,000 and QSL size is 1024. Reference model accuracy is 76.46%. Server scenario latency constraint is 15ms.
## 3. Object Detection
### [Retinanet](object_detection/retinanet.md)
- **Dataset**: OpenImages
- **Size**: 24,781
- **QSL Size**: 64
- **Reference Model Accuracy**: 0.3755 mAP
- **Server Scenario Latency Constraint**: 100ms

2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and subset of Coco2014 dataset. Dataset size is 5000 amd QSL size is the same. Required accuracy for closed division is (23.01085758 <= FID <= 23.95007626, 32.68631873 <= CLIP <= 31.81331801).
---

3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset.Dataset size is 24781 and QSL size is 64. Reference model accuracy is 0.3755 mAP. Server scenario latency constraint is 100ms.
## 4. Medical Image Segmentation
### [3d-unet](medical_imaging/3d-unet.md)
- **Dataset**: KiTS2019
- **Size**: 42
- **QSL Size**: 42
- **Reference Model Accuracy**: 0.86330 Mean DICE Score
- **Server Scenario**: Not Applicable

4. [Medical Image Segmentation](medical_imaging/3d-unet.md) using 3d-unet model and KiTS2019 dataset. Dataset size is 42 and QSL size is the same. Reference model accuracy is 0.86330 mean DIXE score. Server scenario is not applicable.
---

5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset with 384 sequence length. Dataset size is 10833 and QSL size is the same. Reference model accuracy is f1 score = 90.874%. Server scenario latency constraint is 130ms.
## 5. Language Tasks

6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail v3.0.0 dataset. Dataset size is 13368 amd QSL size is the same. Reference model accuracy is (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881, gen_len=4016878). Server scenario latency sconstraint is 20s.
### 5.1. Question Answering

7. [Question Answering](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA (GPT-4 split, max_seq_len=1024) dataset. Dataset size is 24576 and QSL size is the same. Reference model accuracy is (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.
### [Bert-Large](language/bert.md)
- **Dataset**: Squad v1.1 (384 Sequence Length)
- **Size**: 10,833
- **QSL Size**: 10,833
- **Reference Model Accuracy**: F1 Score = 90.874%
- **Server Scenario Latency Constraint**: 130ms

8. [Question Answering, Math and Code Generation](language/mixtral-8x7b.md) using Mixtral-8x7B model and OpenORCA (5k samples of GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) datasets. Dataset size is 15000 and QSL size is the same. Reference model accuracy is (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, gsm8k accuracy = 73.78, mbxp accuracy = 60.12, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.
### [LLAMA2-70B](language/llama2-70b.md)
- **Dataset**: OpenORCA (GPT-4 split, max_seq_len=1024)
- **Size**: 24,576
- **QSL Size**: 24,576
- **Reference Model Accuracy**:
- Rouge1: 44.4312
- Rouge2: 22.0352
- RougeL: 28.6162
- Tokens_per_sample: 294.45
- **Server Scenario Latency Constraint**:
- TTFT: 2000ms
- TPOT: 200ms

9. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Synthetic Multihot Criteo dataset. Dataset size is 204800 and QSL size is the same. Reference model accuracy is AUC=80.31%. Server scenario latency constraint is 60 ms.
### 5.2. Text Summarization

### [GPT-J](language/gpt-j.md)
- **Dataset**: CNN Daily Mail v3.0.0
- **Size**: 13,368
- **QSL Size**: 13,368
- **Reference Model Accuracy**:
- Rouge1: 42.9865
- Rouge2: 20.1235
- RougeL: 29.9881
- Gen_len: 4,016,878
- **Server Scenario Latency Constraint**: 20s

### 5.3. Mixed Tasks (Question Answering, Math, and Code Generation)

### [Mixtral-8x7B](language/mixtral-8x7b.md)
- **Datasets**:
- OpenORCA (5k samples of GPT-4 split, max_seq_len=2048)
- GSM8K (5k samples of the validation split, max_seq_len=2048)
- MBXP (5k samples of the validation split, max_seq_len=2048)
- **Size**: 15,000
- **QSL Size**: 15,000
- **Reference Model Accuracy**:
- Rouge1: 45.4911
- Rouge2: 23.2829
- RougeL: 30.3615
- GSM8K Accuracy: 73.78%
- MBXP Accuracy: 60.12%
- Tokens_per_sample: 294.45
- **Server Scenario Latency Constraint**:
- TTFT: 2000ms
- TPOT: 200ms

---

## 6. Recommendation
### [DLRMv2](recommendation/dlrm-v2.md)
- **Dataset**: Synthetic Multihot Criteo
- **Size**: 204,800
- **QSL Size**: 204,800
- **Reference Model Accuracy**: AUC = 80.31%
- **Server Scenario Latency Constraint**: 60ms

---

All the nine benchmarks can participate in the datacenter category.
All the nine benchmarks except DLRMv2, LLAMA2 and Mixtral-8x7B and can participate in the edge category.
### Participation Categories
- **Datacenter Category**: All nine benchmarks can participate.
- **Edge Category**: All benchmarks except DLRMv2, LLAMA2, and Mixtral-8x7B can participate.

`bert`, `llama2-70b`, `dlrm_v2` and `3d-unet` has a high accuracy (99.9%) variant, where the benchmark run must achieve a higher accuracy of at least `99.9%` of the FP32 reference model
in comparison with the `99%` default accuracy requirement.
### High Accuracy Variants
- **Benchmarks**: `bert`, `llama2-70b`, `dlrm_v2`, and `3d-unet`
- **Requirement**: Must achieve at least 99.9% of the FP32 reference model accuracy, compared to the default 99% accuracy requirement.
6 changes: 0 additions & 6 deletions docs/benchmarks/text_to_image/sdxl.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,3 @@ hide:
## Intel MLPerf Implementation
{{ mlperf_inference_implementation_readme (4, "sdxl", "intel") }}


=== "Qualcomm"
## Qualcomm AI100 MLPerf Implementation

{{ mlperf_inference_implementation_readme (4, "sdxl", "qualcomm") }}

7 changes: 7 additions & 0 deletions docs/install/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,12 @@ CM needs `git`, `python3-pip` and `python3-venv` installed on your system. If an
pip install cm4mlops
```

## To work on custom GitHub repo and branch

```bash
pip install cmind && cm init --quiet --repo=mlcommons@cm4mlops --branch=mlperf-inference
```

Here, repo is in the format `githubUsername@githubRepo`.

Now, you are ready to use the `cm` commands to run MLPerf inference as given in the [benchmarks](../benchmarks/index.md) page
43 changes: 31 additions & 12 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
elif model.lower() == "retinanet":
frameworks = [ "Onnxruntime", "Pytorch" ]
elif "bert" in model.lower():
frameworks = [ "Onnxruntime", "Pytorch", "Tensorflow" ]
frameworks = [ "Pytorch" ]
else:
frameworks = [ "Pytorch" ]

elif implementation == "nvidia":
if model in [ "sdxl", "mixtral-8x7b" ]:
if model in [ "mixtral-8x7b" ]:
return pre_space+" WIP"
devices = [ "CUDA" ]
frameworks = [ "TensorRT" ]
Expand All @@ -39,7 +39,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
frameworks = [ "pytorch" ]

elif implementation == "intel":
if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9", "dlrm-v2-99", "dlrm-v2-99.9" ]:
if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9", "dlrm-v2-99", "dlrm-v2-99.9", "sdxl" ]:
return pre_space+" WIP"
if model in [ "bert-99", "bert-99.9", "retinanet", "3d-unet-99", "3d-unet-99.9" ]:
code_version="r4.0"
Expand Down Expand Up @@ -115,6 +115,13 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
test_query_count=get_test_query_count(model, implementation, device)

if "99.9" not in model: #not showing docker command as it is already done for the 99% variant
if implementation == "neuralmagic":
content += f"{cur_space3}####### Run the Inference Server\n"
content += get_inference_server_run_cmd(spaces+16,implementation)
# tips regarding the running of nural magic server
content += f"\n{cur_space3}!!! tip\n\n"
content += f"{cur_space3} - Host and Port number of the server can be configured through `--host` and `--port`. Otherwise, server will run on default host `localhost` and port `8000`.\n\n"

if execution_env == "Native": # Native implementation steps through virtual environment
content += f"{cur_space3}####### Setup a virtual environment for Python\n"
content += get_venv_command(spaces+16)
Expand Down Expand Up @@ -159,7 +166,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
#content += run_suffix

content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n"
run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios, code_version)
run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", 0, False, scenarios, code_version)
content += run_cmd
content += run_suffix

Expand Down Expand Up @@ -195,6 +202,16 @@ def get_readme_prefix(spaces, model, implementation):

return readme_prefix

def get_inference_server_run_cmd(spaces, implementation):
indent = " "*spaces + " "
if implementation == "neuralmagic":
pre_space = " "*spaces
return f"""\n
{pre_space}```bash
{pre_space}cm run script --tags=run,vllm-server \\
{indent}--model=nm-testing/Llama-2-70b-chat-hf-FP8
{pre_space}```\n"""

def get_venv_command(spaces):
pre_space = " "*spaces
return f"""\n
Expand Down Expand Up @@ -260,7 +277,7 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
scenario_option = f"\\\n{pre_space} --scenario={scenario}"

if scenario == "Server" or (scenario == "All Scenarios" and "Server" in scenarios):
scenario_option = f"\\\n{pre_space} --server_target_qps=<SERVER_TARGET_QPS>"
scenario_option += f"\\\n{pre_space} --server_target_qps=<SERVER_TARGET_QPS>"

run_cmd_extra = get_run_cmd_extra(f_pre_space, model, implementation, device, scenario, scenarios)

Expand All @@ -269,11 +286,12 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
docker_cmd_suffix += f" \\\n{pre_space} --test_query_count={test_query_count}"

if "llama2-70b" in model:
if implementation != "neuralmagic":
docker_cmd_suffix += f" \\\n{pre_space} --tp_size=<TP_SIZE>"
if implementation == "nvidia":
docker_cmd_suffix += f" \\\n{pre_space} --tp_size=2"
docker_cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>"
else:
docker_cmd_suffix += f" \\\n{pre_space} --api_server=<API_SERVER_URL>"
elif implementation == "neuralmagic":
docker_cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000"
docker_cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8"

if "dlrm-v2" in model and implementation == "nvidia":
docker_cmd_suffix += f" \\\n{pre_space} --criteo_day23_raw_data_path=<PATH_TO_CRITEO_DAY23_RAW_DATA>"
Expand All @@ -298,11 +316,12 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
cmd_suffix += f" \\\n {pre_space} --test_query_count={test_query_count}"

if "llama2-70b" in model:
if implementation != "neuralmagic":
if implementation == "nvidia":
cmd_suffix += f" \\\n{pre_space} --tp_size=<TP_SIZE>"
cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>"
else:
cmd_suffix += f" \\\n{pre_space} --api_server=<API_SERVER_URL>"
elif implementation == "neuralmagic":
cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000"
cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8"

if "dlrm-v2" in model and implementation == "nvidia":
cmd_suffix += f" \\\n{pre_space} --criteo_day23_raw_data_path=<PATH_TO_CRITEO_DAY23_RAW_DATA>"
Expand Down

0 comments on commit acf5591

Please sign in to comment.