Merge pull request #6 from anandhu-eng/cm_readme_inference_update

Cm readme inference update
anandhu-eng · Aug 13, 2024 · acf5591 · acf5591
2 parents a6429f3 + 7e37072
commit acf5591
Show file tree

Hide file tree

Showing 4 changed files with 147 additions and 35 deletions.
diff --git a/docs/benchmarks/index.md b/docs/benchmarks/index.md
@@ -1,32 +1,124 @@
+# MLPerf Inference Benchmarks
+
+## Overview
+This document provides details on various MLPerf Inference Benchmarks categorized by tasks, models, and datasets. Each section lists the models performing similar tasks, with details on datasets, accuracy, and server latency constraints.
+
 ---
-hide:
-  - toc
+
+## 1. Image Classification
+### [ResNet50-v1.5](image_classification/resnet50.md)
+- **Dataset**: Imagenet-2012 (224x224) Validation
+  - **Size**: 50,000
+  - **QSL Size**: 1,024
+- **Reference Model Accuracy**: 76.46%
+- **Server Scenario Latency Constraint**: 15ms
+
 ---
 
-# MLPerf Inference Benchmarks
+## 2. Text to Image
+### [Stable Diffusion](text_to_image/sdxl.md)
+- **Dataset**: Subset of Coco2014
+  - **Size**: 5,000
+  - **QSL Size**: 5,000
+- **Required Accuracy (Closed Division)**:
+  - FID: 23.01085758 ≤ FID ≤ 23.95007626
+  - CLIP: 32.68631873 ≤ CLIP ≤ 31.81331801
 
-Please visit the individual benchmark links to see the run commands using the unified CM interface.
+---
 
-1. [Image Classification](image_classification/resnet50.md) using ResNet50-v1.5 model and Imagenet-2012 (224x224) validation dataset. Dataset size is 50,000 and QSL size is 1024. Reference model accuracy is 76.46%. Server scenario latency constraint is 15ms.
+## 3. Object Detection
+### [Retinanet](object_detection/retinanet.md)
+- **Dataset**: OpenImages
+  - **Size**: 24,781
+  - **QSL Size**: 64
+- **Reference Model Accuracy**: 0.3755 mAP
+- **Server Scenario Latency Constraint**: 100ms
 
-2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and subset of Coco2014 dataset. Dataset size is 5000 amd QSL size is the same. Required accuracy for closed division is (23.01085758 <= FID <= 23.95007626, 32.68631873 <= CLIP <= 31.81331801).
+---
 
-3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset.Dataset size is 24781 and QSL size is 64. Reference model accuracy is 0.3755 mAP. Server scenario latency constraint is 100ms.
+## 4. Medical Image Segmentation
+### [3d-unet](medical_imaging/3d-unet.md)
+- **Dataset**: KiTS2019
+  - **Size**: 42
+  - **QSL Size**: 42
+- **Reference Model Accuracy**: 0.86330 Mean DICE Score
+- **Server Scenario**: Not Applicable
 
-4. [Medical Image Segmentation](medical_imaging/3d-unet.md)  using 3d-unet model and KiTS2019 dataset. Dataset size is 42 and QSL size is the same. Reference model accuracy is 0.86330 mean DIXE score. Server scenario is not applicable.
+---
 
-5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset with 384 sequence length. Dataset size is 10833 and QSL size is the same. Reference model accuracy is f1 score = 90.874%. Server scenario latency constraint is 130ms.
+## 5. Language Tasks
 
-6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail v3.0.0 dataset. Dataset size is 13368 amd QSL size is the same. Reference model accuracy is (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881, gen_len=4016878). Server scenario latency sconstraint is 20s.
+### 5.1. Question Answering
 
-7. [Question Answering](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA (GPT-4 split, max_seq_len=1024) dataset. Dataset size is 24576 and QSL size is the same. Reference model accuracy is (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.
+### [Bert-Large](language/bert.md)
+- **Dataset**: Squad v1.1 (384 Sequence Length)
+  - **Size**: 10,833
+  - **QSL Size**: 10,833
+- **Reference Model Accuracy**: F1 Score = 90.874%
+- **Server Scenario Latency Constraint**: 130ms
 
-8. [Question Answering, Math and Code Generation](language/mixtral-8x7b.md) using Mixtral-8x7B model and OpenORCA (5k samples of GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) datasets. Dataset size is 15000 and QSL size is the same. Reference model accuracy is (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, gsm8k accuracy = 73.78, mbxp accuracy = 60.12, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.
+### [LLAMA2-70B](language/llama2-70b.md)
+- **Dataset**: OpenORCA (GPT-4 split, max_seq_len=1024)
+  - **Size**: 24,576
+  - **QSL Size**: 24,576
+- **Reference Model Accuracy**:
+  - Rouge1: 44.4312
+  - Rouge2: 22.0352
+  - RougeL: 28.6162
+  - Tokens_per_sample: 294.45
+- **Server Scenario Latency Constraint**:
+  - TTFT: 2000ms
+  - TPOT: 200ms
 
-9. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Synthetic Multihot Criteo dataset. Dataset size is 204800 and QSL size is the same. Reference model accuracy is AUC=80.31%. Server scenario latency constraint is 60 ms. 
+### 5.2. Text Summarization
+
+### [GPT-J](language/gpt-j.md)
+- **Dataset**: CNN Daily Mail v3.0.0
+  - **Size**: 13,368
+  - **QSL Size**: 13,368
+- **Reference Model Accuracy**:
+  - Rouge1: 42.9865
+  - Rouge2: 20.1235
+  - RougeL: 29.9881
+  - Gen_len: 4,016,878
+- **Server Scenario Latency Constraint**: 20s
+
+### 5.3. Mixed Tasks (Question Answering, Math, and Code Generation)
+
+### [Mixtral-8x7B](language/mixtral-8x7b.md)
+- **Datasets**:
+  - OpenORCA (5k samples of GPT-4 split, max_seq_len=2048)
+  - GSM8K (5k samples of the validation split, max_seq_len=2048)
+  - MBXP (5k samples of the validation split, max_seq_len=2048)
+  - **Size**: 15,000
+  - **QSL Size**: 15,000
+- **Reference Model Accuracy**:
+  - Rouge1: 45.4911
+  - Rouge2: 23.2829
+  - RougeL: 30.3615
+  - GSM8K Accuracy: 73.78%
+  - MBXP Accuracy: 60.12%
+  - Tokens_per_sample: 294.45
+- **Server Scenario Latency Constraint**:
+  - TTFT: 2000ms
+  - TPOT: 200ms
+
+---
+
+## 6. Recommendation
+### [DLRMv2](recommendation/dlrm-v2.md)
+- **Dataset**: Synthetic Multihot Criteo
+  - **Size**: 204,800
+  - **QSL Size**: 204,800
+- **Reference Model Accuracy**: AUC = 80.31%
+- **Server Scenario Latency Constraint**: 60ms
+
+---
 
-All the nine benchmarks can participate in the datacenter category.
-All the nine benchmarks except DLRMv2, LLAMA2 and Mixtral-8x7B and can participate in the edge category. 
+### Participation Categories
+- **Datacenter Category**: All nine benchmarks can participate.
+- **Edge Category**: All benchmarks except DLRMv2, LLAMA2, and Mixtral-8x7B can participate.
 
-`bert`, `llama2-70b`, `dlrm_v2` and `3d-unet` has a high accuracy (99.9%) variant, where the benchmark run  must achieve a higher accuracy of at least `99.9%` of the FP32 reference model
-in comparison with the `99%` default accuracy requirement.
+### High Accuracy Variants
+- **Benchmarks**: `bert`, `llama2-70b`, `dlrm_v2`, and `3d-unet`
+- **Requirement**: Must achieve at least 99.9% of the FP32 reference model accuracy, compared to the default 99% accuracy requirement.
diff --git a/docs/benchmarks/text_to_image/sdxl.md b/docs/benchmarks/text_to_image/sdxl.md
@@ -20,9 +20,3 @@ hide:
     ## Intel MLPerf Implementation
 {{ mlperf_inference_implementation_readme (4, "sdxl", "intel") }}
 
-
-=== "Qualcomm"
-    ## Qualcomm AI100 MLPerf Implementation
-
-{{ mlperf_inference_implementation_readme (4, "sdxl", "qualcomm") }}
-
diff --git a/docs/install/index.md b/docs/install/index.md
@@ -20,5 +20,12 @@ CM needs `git`, `python3-pip` and `python3-venv` installed on your system. If an
    pip install cm4mlops
 ```
 
+## To work on custom GitHub repo and branch
+
+```bash
+   pip install cmind && cm init --quiet --repo=mlcommons@cm4mlops --branch=mlperf-inference
+```
+
+Here, repo is in the format `githubUsername@githubRepo`.
 
 Now, you are ready to use the `cm` commands to run MLPerf inference as given in the [benchmarks](../benchmarks/index.md) page
diff --git a/main.py b/main.py
@@ -24,12 +24,12 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
             elif model.lower() == "retinanet":
                  frameworks = [ "Onnxruntime", "Pytorch" ]
             elif "bert" in model.lower():
-                 frameworks = [ "Onnxruntime", "Pytorch", "Tensorflow" ]
+                 frameworks = [ "Pytorch" ]
             else:
                  frameworks = [ "Pytorch" ]
 
         elif implementation == "nvidia":
-            if model in [ "sdxl", "mixtral-8x7b" ]:
+            if model in [ "mixtral-8x7b" ]:
                  return pre_space+"    WIP"
             devices = [ "CUDA" ]
             frameworks = [ "TensorRT" ]
@@ -39,7 +39,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
             frameworks = [ "pytorch" ]
 
         elif implementation == "intel":
-            if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9", "dlrm-v2-99", "dlrm-v2-99.9" ]:
+            if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9", "dlrm-v2-99", "dlrm-v2-99.9", "sdxl" ]:
                  return pre_space+"    WIP"
             if model in [ "bert-99", "bert-99.9", "retinanet", "3d-unet-99", "3d-unet-99.9" ]:
                  code_version="r4.0"
@@ -115,6 +115,13 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
                         test_query_count=get_test_query_count(model, implementation, device)
 
                         if "99.9" not in model: #not showing docker command as it is already done for the 99% variant
+                            if implementation == "neuralmagic":
+                                content += f"{cur_space3}####### Run the Inference Server\n"
+                                content += get_inference_server_run_cmd(spaces+16,implementation)
+                                # tips regarding the running of nural magic server
+                                content += f"\n{cur_space3}!!! tip\n\n"
+                                content += f"{cur_space3}    - Host and Port number of the server can be configured through `--host` and `--port`. Otherwise, server will run on default host `localhost` and port `8000`.\n\n"
+
                             if execution_env == "Native": # Native implementation steps through virtual environment
                                 content += f"{cur_space3}####### Setup a virtual environment for Python\n"
                                 content += get_venv_command(spaces+16)
@@ -159,7 +166,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
                             #content += run_suffix
 
                         content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n"
-                        run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios, code_version)
+                        run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", 0, False, scenarios, code_version)
                         content += run_cmd
                         content += run_suffix
 
@@ -195,6 +202,16 @@ def get_readme_prefix(spaces, model, implementation):
 
         return readme_prefix
 
+    def get_inference_server_run_cmd(spaces, implementation):
+        indent = " "*spaces + " "
+        if implementation == "neuralmagic":
+            pre_space = " "*spaces
+            return f"""\n
+{pre_space}```bash
+{pre_space}cm run script --tags=run,vllm-server \\
+{indent}--model=nm-testing/Llama-2-70b-chat-hf-FP8 
+{pre_space}```\n"""
+
     def get_venv_command(spaces):
       pre_space = " "*spaces
       return f"""\n
@@ -260,7 +277,7 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
             scenario_option = f"\\\n{pre_space} --scenario={scenario}"
 
         if scenario == "Server" or (scenario == "All Scenarios" and "Server" in scenarios):
-            scenario_option = f"\\\n{pre_space} --server_target_qps=<SERVER_TARGET_QPS>"
+            scenario_option += f"\\\n{pre_space} --server_target_qps=<SERVER_TARGET_QPS>"
 
         run_cmd_extra = get_run_cmd_extra(f_pre_space, model, implementation, device, scenario, scenarios)
 
@@ -269,11 +286,12 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
             docker_cmd_suffix += f" \\\n{pre_space} --test_query_count={test_query_count}"
 
             if "llama2-70b" in model:
-                if implementation != "neuralmagic":
-                    docker_cmd_suffix += f" \\\n{pre_space} --tp_size=<TP_SIZE>"
+                if implementation == "nvidia":
+                    docker_cmd_suffix += f" \\\n{pre_space} --tp_size=2"
                     docker_cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>"
-                else:
-                    docker_cmd_suffix += f" \\\n{pre_space} --api_server=<API_SERVER_URL>"
+                elif implementation == "neuralmagic":
+                    docker_cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000"
+                    docker_cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8"
 
             if "dlrm-v2" in model and implementation == "nvidia":
                 docker_cmd_suffix += f" \\\n{pre_space} --criteo_day23_raw_data_path=<PATH_TO_CRITEO_DAY23_RAW_DATA>"
@@ -298,11 +316,12 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
                 cmd_suffix += f" \\\n {pre_space} --test_query_count={test_query_count}"
 
             if "llama2-70b" in model:
-                if implementation != "neuralmagic":
+                if implementation == "nvidia":
                     cmd_suffix += f" \\\n{pre_space} --tp_size=<TP_SIZE>"
                     cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>"
-                else:
-                    cmd_suffix += f" \\\n{pre_space} --api_server=<API_SERVER_URL>"
+                elif implementation == "neuralmagic":
+                    cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000"
+                    cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8"
 
             if "dlrm-v2" in model and implementation == "nvidia":
                 cmd_suffix += f" \\\n{pre_space} --criteo_day23_raw_data_path=<PATH_TO_CRITEO_DAY23_RAW_DATA>"