Extending TGI benchmarking and documentation (#621)

* Initial Llama3-70b test * Missing .env file. .gitignore strikes again! * Adding script to run multiple batch sizes at once * changed mode to +x on shell script * fixing my poor bash syntax * Renaming directory to test on Trainium * adding trainium * Trainium compose example added * Readme changes * More Readme changes * Adding BS1 numbers * misspelling in export_model.mdx * misspelling in benchmark/text-generation-inference/README.md Co-authored-by: David Corvoysier <[email protected]> * Removing redundant HF_BATCH_SIZE and HF_SEQUENCE_LENGTH settings from .env and docker compose. * Trainium batch size 8 numbers added. --------- Co-authored-by: David Corvoysier <[email protected]>
huggingface · Jun 5, 2024 · af0506f · af0506f
1 parent a3bb344
commit af0506f
Show file tree

Hide file tree

Showing 15 changed files with 257 additions and 12 deletions.
diff --git a/benchmark/text-generation-inference/README.md b/benchmark/text-generation-inference/README.md
@@ -1,16 +1,62 @@
 # NeuronX TGI benchmark using multiple replicas
 
+## Local environment setup
+
+These configurations are tested and run on an inf2.48xlarge with the Hugging Face Deep Learning AMI from the AWS Marketplace.  
+
+Copy the configurations down using
+
+```shell
+$ git clone https://github.com/huggingface/optimum-neuron.git
+$ cd optimum-neuron/benchmark/text-generation-inference/
+```
+
+
 ## Select model and configuration
 
 Edit the `.env` file to select the model to use for the benchmark and its configuration.
 
+The following instructions assume that you are testing a locally built image, so docker would have stored image neuronx-tgi:latest.
+
+You can confirm this by running:
+
+```shell
+$ docker image ls
+```
+
+If you have not built it locally, you can download it and retag it using the following commands
+
+```shell
+$ docker pull ghcr.io/huggingface/neuronx-tgi:latest
+$ docker tag ghcr.io/huggingface/neuronx-tgi:latest neuronx-tgi:latest
+```
+You should then see the single IMAGE ID with two different sets of tags:
+
+```shell
+$ docker image ls
+REPOSITORY                        TAG       IMAGE ID       CREATED        SIZE
+neuronx-tgi                       latest    f5ba57f8517b   12 hours ago   11.3GB
+ghcr.io/huggingface/neuronx-tgi   latest    f5ba57f8517b   12 hours ago   11.3GB
+```
+
+
+Alternatively, you can edit the appropriate docker-compose.yaml to supply the fully path by changing ```neuronx-tgi:latest``` to ```ghcr.io/huggingface/neuronx-tgi:latest```
+
 ## Start the servers
 
+For smaller models, you can use the multi-server configuration with a load balancer:
+
 ```shell
 $ docker compose --env-file llama-7b/.env up
 ```
 
-Note: replace the .env file to change the model configuration
+For larger models, use their specific docker files:
+
+```shell
+$ docker compose -f llama3-70b/docker-compose.yaml --env-file llama3-70b/.env up
+```
+
+Note: edit the .env file to change the model configuration
 
 ## Run the benchmark
 
@@ -33,4 +79,59 @@ The `model_id` must match the one corresponding to the selected `.env` file.
 $ ./benchmark.sh NousResearch/Llama-2-7b-chat-hf 128
 ```
 
+If you would like to run the benchmark script multiple times with different concurrent user parameters, you can use:
+
+```
+$ ./run_all.sh NousResearch/Meta-Llama-3-70B-Instruct
+```
+
+### Compiling the model
+
+If you are trying to run a configuration or a model that is not available in the cache, you can compile the model before you run it, then load it locally. 
+
+See the [llama3-70b-trn1.32xlarge](llama3-70b-trn1.32xlarge) as an example.
+
+It is best to compile the model with the software in the container you will be using to ensure all library versions match.
+
+As an example, you can compile with the following command.  **If you make changes, make sure your batch size, sequence length, and num_cores for compilation match the same settings in the .env file**
+
+```
+docker run -p 8080:80 \
+-v $(pwd):/data \
+--device=/dev/neuron0 \
+--device=/dev/neuron1 \
+--device=/dev/neuron2 \
+--device=/dev/neuron3 \
+--device=/dev/neuron4 \
+--device=/dev/neuron5 \
+--device=/dev/neuron6 \
+--device=/dev/neuron7 \
+--device=/dev/neuron8 \
+--device=/dev/neuron9 \
+--device=/dev/neuron10 \
+--device=/dev/neuron11 \
+--device=/dev/neuron12 \
+--device=/dev/neuron13 \
+--device=/dev/neuron14 \
+--device=/dev/neuron15 \
+-ti \
+--entrypoint "optimum-cli" neuronx-tgi:latest \
+export neuron --model NousResearch/Meta-Llama-3-70B-Instruct \
+--sequence_length 4096 \
+--batch_size 4 \
+--num_cores 32 \
+/data/exportedmodel/
+```
+
+Note that the .env file has a path for MODEL_ID to load the model from the /data directory.
+
+Also, the docker-compose.yaml file includes an additional parameter to map the volume to the current working directory, as well as additional Neuron device mappings because trn1.32xlarge has 32 cores (16 devices).
+
+Make sure you run the above command and the docker compose command from the same directory since it maps the /data directory to the current working directory.
+
+For this example:
+```
+$ docker compose -f llama3-70b-trn1.32xlarge/docker-compose.yaml --env-file llama3-70b-trn1.32xlarge/.env up
+```
+
 
diff --git a/benchmark/text-generation-inference/docker-compose.yaml b/benchmark/text-generation-inference/docker-compose.yaml
@@ -8,8 +8,6 @@ services:
     environment:
       - PORT=8081
       - MODEL_ID=${MODEL_ID}
-      - HF_BATCH_SIZE=${HF_BATCH_SIZE}
-      - HF_SEQUENCE_LENGTH=${HF_SEQUENCE_LENGTH}
       - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
       - HF_NUM_CORES=8
       - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
@@ -29,8 +27,6 @@ services:
     environment:
       - PORT=8082
       - MODEL_ID=${MODEL_ID}
-      - HF_BATCH_SIZE=${HF_BATCH_SIZE}
-      - HF_SEQUENCE_LENGTH=${HF_SEQUENCE_LENGTH}
       - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
       - HF_NUM_CORES=8
       - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
@@ -50,8 +46,6 @@ services:
     environment:
       - PORT=8083
       - MODEL_ID=${MODEL_ID}
-      - HF_BATCH_SIZE=${HF_BATCH_SIZE}
-      - HF_SEQUENCE_LENGTH=${HF_SEQUENCE_LENGTH}
       - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
       - HF_NUM_CORES=8
       - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}

diff --git a/benchmark/text-generation-inference/llama-7b/.env b/benchmark/text-generation-inference/llama-7b/.env
@@ -1,6 +1,4 @@
 MODEL_ID='NousResearch/Llama-2-7b-chat-hf'
-HF_BATCH_SIZE=32
-HF_SEQUENCE_LENGTH=4096
 HF_AUTO_CAST_TYPE='fp16'
 MAX_BATCH_SIZE=32
 MAX_INPUT_LENGTH=3072

diff --git a/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/.env b/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/.env
@@ -0,0 +1,7 @@
+MODEL_ID='NousResearch/Meta-Llama-3-70B-Instruct'
+HF_AUTO_CAST_TYPE='fp16'
+MAX_BATCH_SIZE=4
+MAX_INPUT_LENGTH=4000
+MAX_TOTAL_TOKENS=4096
+# MESSAGES_API_ENABLED='true' # Enable the messages API
+
diff --git a/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/docker-compose.yaml b/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/docker-compose.yaml
@@ -0,0 +1,29 @@
+version: '3.7'
+
+services:
+  tgi-1:
+    image: neuronx-tgi:latest
+    ports:
+      - "8080:8080"
+    environment:
+      - PORT=8080
+      - MODEL_ID=${MODEL_ID}
+      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
+      - HF_NUM_CORES=24
+      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
+      - MAX_INPUT_LENGTH=${MAX_INPUT_LENGTH}
+      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
+      - MAX_CONCURRENT_REQUESTS=512
+    devices:
+      - "/dev/neuron0"
+      - "/dev/neuron1"
+      - "/dev/neuron2"
+      - "/dev/neuron3"
+      - "/dev/neuron4"
+      - "/dev/neuron5"
+      - "/dev/neuron6"
+      - "/dev/neuron7"
+      - "/dev/neuron8"
+      - "/dev/neuron9"
+      - "/dev/neuron10"
+      - "/dev/neuron11"
diff --git a/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/tgi-results-batchsize-1.csv b/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/tgi-results-batchsize-1.csv
@@ -0,0 +1,11 @@
+model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,30.170455300639418,0.7694021150018671,31.60879417184807
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,30.942167505908383,3.5238446079965797,42.88674224324184
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,31.216016638279726,11.17000110349909,70.63270124966144
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,31.442002397963858,28.047803349007154,138.61752441316904
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,31.622091010175804,60.1780687940045,290.1370155727129
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,31.734201827193452,123.7196121570014,523.1909448482422
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,31.72544803588566,250.5079138929941,1010.6170931343223
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,31.805759572717598,512.6742304505024,1997.340511562319
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,31.776200117214845,1025.654853393993,3954.0575741908333
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,31.715118036351587,2034.146784478493,8002.648273082725
diff --git a/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/tgi-results.csv b/benchmark/text-generation-inference/llama3-70b-inf2.48xlarge/tgi-results.csv
@@ -0,0 +1,11 @@
+model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,18.818667211424472,1.3884793975012144,51.46871325828836
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,32.22257477833452,2.0121661404991755,56.734265583687296
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,50.19917175671667,5.205651430500438,66.04042245148653
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,52.13272738944358,9.568476632499369,97.32615035298838
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,53.59997031445967,26.087651531999654,191.19227161475598
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,56.08684244759754,61.25285707449984,310.16900484570965
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,57.40338464731561,129.3146581359997,560.2474255463762
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,58.39025853766574,267.3882590960002,1094.9986170264501
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,58.589480601098536,541.6153878579971,2147.5413489446523
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,58.69645477077839,1085.1772966810022,4231.7554182432905
diff --git a/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/.env b/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/.env
@@ -0,0 +1,8 @@
+#MODEL_ID='NousResearch/Meta-Llama-3-70B-Instruct'
+MODEL_ID='/data/exportedmodel'
+HF_AUTO_CAST_TYPE='fp16'
+MAX_BATCH_SIZE=4
+MAX_INPUT_LENGTH=4000
+MAX_TOTAL_TOKENS=4096
+# MESSAGES_API_ENABLED='true' # Enable the messages API
+
diff --git a/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/docker-compose.yaml b/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/docker-compose.yaml
@@ -0,0 +1,36 @@
+version: '3.7'
+
+services:
+  tgi-1:
+    image: neuronx-tgi:latest
+    ports:
+      - "8080:8080"
+    environment:
+      - PORT=8080
+      - MODEL_ID=${MODEL_ID}
+      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
+      - HF_NUM_CORES=32
+      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
+      - MAX_INPUT_LENGTH=${MAX_INPUT_LENGTH}
+      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
+      - MAX_CONCURRENT_REQUESTS=512
+    volumes:
+      - $PWD:/data
+    devices:
+      - "/dev/neuron0"
+      - "/dev/neuron1"
+      - "/dev/neuron2"
+      - "/dev/neuron3"
+      - "/dev/neuron4"
+      - "/dev/neuron5"
+      - "/dev/neuron6"
+      - "/dev/neuron7"
+      - "/dev/neuron8"
+      - "/dev/neuron9"
+      - "/dev/neuron10"
+      - "/dev/neuron11"
+      - "/dev/neuron12"
+      - "/dev/neuron13"
+      - "/dev/neuron14"
+      - "/dev/neuron15"
+
diff --git a/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/tgi-results-batchsize-1.csv b/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/tgi-results-batchsize-1.csv
@@ -0,0 +1,11 @@
+model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,38.29638310438374,0.5521726660008426,24.784959740501066
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,38.98036959617541,2.72243953349971,32.827924415254174
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,39.39299322930307,8.926065296996967,63.795771842799695
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,39.85480734427003,22.479033984491252,110.33245410384168
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,39.797703130119444,48.74777327400079,218.4971534548553
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,39.88112179496438,98.32968477499526,419.0164926030421
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,40.021570341867225,201.50347035600862,787.0418267487788
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,40.15190355766733,412.9219288924942,1608.1377339868322
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,40.10404829156176,831.7238280020028,3167.7755826448656
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,39.94606130182408,1654.066714687011,6348.469898092637
diff --git a/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/tgi-results-batchsize-8.csv b/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/tgi-results-batchsize-8.csv
@@ -0,0 +1,11 @@
+model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,17.8322790536497,0.9939256490033586,54.45429111182844
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,31.140113024869468,1.418605798491626,58.17940704286386
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,52.71447508703364,3.691673280511168,65.510341492747
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,85.23757246875635,7.40343523149204,79.86574747355823
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,83.41704442714865,12.134337133495137,119.80365178993138
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,86.31413401709217,33.19637775150477,221.51387761253872
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,91.54051788296289,78.17263232148252,378.5575452672668
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,93.59227409861985,163.85781266850245,709.4836254794548
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,94.49695504491365,332.89309809000406,1342.054465909721
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,94.76202310893393,671.8385370509932,2633.1926459323054
diff --git a/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/tgi-results.csv b/benchmark/text-generation-inference/llama3-70b-trn1.32xlarge/tgi-results.csv
@@ -0,0 +1,11 @@
+model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,27.321283482983713,0.9897541589998582,34.53017190612728
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,47.14780790833105,1.4317841799993403,38.47682874008382
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,75.46880157534952,3.7293467640001836,45.219761063884626
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,76.656177664245,6.710071522500584,67.5562098563004
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,78.10745154737947,18.174910198499674,130.32796764867985
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,80.94695720514072,42.99618862100033,211.52529640942643
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,83.41961944293132,90.68870028399942,387.7336944140728
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,84.68410927601217,187.20342993849863,761.1909438667759
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,85.08930039980858,376.98190486400017,1484.3806421055476
+huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,84.99711473871804,758.8232675055006,2947.3092666464
diff --git a/benchmark/text-generation-inference/mistral-7b/.env b/benchmark/text-generation-inference/mistral-7b/.env
@@ -1,6 +1,4 @@
 MODEL_ID='mistralai/Mistral-7B-Instruct-v0.2'
-HF_BATCH_SIZE=32
-HF_SEQUENCE_LENGTH=4096
 HF_AUTO_CAST_TYPE='bf16'
 MAX_BATCH_SIZE=32
 MAX_INPUT_LENGTH=3072

diff --git a/benchmark/text-generation-inference/run_all.sh b/benchmark/text-generation-inference/run_all.sh
@@ -0,0 +1,19 @@
+#!/bin/bash -v
+# This is made to be run on the Hugging Face DLAMI on an inferentia/trainium system
+
+# at the end of this script, run 
+# python generate_csv.py
+
+# change the modelname on the next line.
+modelname=${1:-NousResearch/Llama-2-7b-chat-hf}
+echo on
+#set for your environment if not already set
+#export LLMPerf=/home/ubuntu/llmperf
+
+for concurrency in 1 2 4 8 16 32 64 128 256 512
+    do
+
+    ./benchmark.sh ${modelname} ${concurrency}
+
+
+done
diff --git a/docs/source/guides/export_model.mdx b/docs/source/guides/export_model.mdx
@@ -421,7 +421,7 @@ The NeuronX TGI image includes not only NeuronX runtime, but also all packages a
 Use the following command to export a model to Neuron using a TGI image:
 
 ```
-docker run --emtrypoint optimum-cli \
+docker run --entrypoint optimum-cli \
        -v $(pwd)/data:/data \
        --privileged \
        ghcr.io/huggingface/neuronx-tgi:latest \