Skip to content

Commit

Permalink
Extending TGI benchmarking and documentation (#621)
Browse files Browse the repository at this point in the history
* Initial Llama3-70b test

* Missing .env file. .gitignore strikes again!

* Adding script to run multiple batch sizes at once

* changed mode to +x on shell script

* fixing my poor bash syntax

* Renaming directory to test on Trainium

* adding trainium

* Trainium compose example added

* Readme changes

* More Readme changes

* Adding BS1 numbers

* misspelling in export_model.mdx

* misspelling in benchmark/text-generation-inference/README.md

Co-authored-by: David Corvoysier <[email protected]>

* Removing redundant HF_BATCH_SIZE and
HF_SEQUENCE_LENGTH settings from .env and
docker compose.

* Trainium batch size 8 numbers added.

---------

Co-authored-by: David Corvoysier <[email protected]>
  • Loading branch information
jimburtoft and dacorvo authored Jun 5, 2024
1 parent a3bb344 commit af0506f
Show file tree
Hide file tree
Showing 15 changed files with 257 additions and 12 deletions.
103 changes: 102 additions & 1 deletion benchmark/text-generation-inference/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,62 @@
# NeuronX TGI benchmark using multiple replicas

## Local environment setup

These configurations are tested and run on an inf2.48xlarge with the Hugging Face Deep Learning AMI from the AWS Marketplace.

Copy the configurations down using

```shell
$ git clone https://github.com/huggingface/optimum-neuron.git
$ cd optimum-neuron/benchmark/text-generation-inference/
```


## Select model and configuration

Edit the `.env` file to select the model to use for the benchmark and its configuration.

The following instructions assume that you are testing a locally built image, so docker would have stored image neuronx-tgi:latest.

You can confirm this by running:

```shell
$ docker image ls
```

If you have not built it locally, you can download it and retag it using the following commands

```shell
$ docker pull ghcr.io/huggingface/neuronx-tgi:latest
$ docker tag ghcr.io/huggingface/neuronx-tgi:latest neuronx-tgi:latest
```
You should then see the single IMAGE ID with two different sets of tags:

```shell
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
neuronx-tgi latest f5ba57f8517b 12 hours ago 11.3GB
ghcr.io/huggingface/neuronx-tgi latest f5ba57f8517b 12 hours ago 11.3GB
```


Alternatively, you can edit the appropriate docker-compose.yaml to supply the fully path by changing ```neuronx-tgi:latest``` to ```ghcr.io/huggingface/neuronx-tgi:latest```

## Start the servers

For smaller models, you can use the multi-server configuration with a load balancer:

```shell
$ docker compose --env-file llama-7b/.env up
```

Note: replace the .env file to change the model configuration
For larger models, use their specific docker files:

```shell
$ docker compose -f llama3-70b/docker-compose.yaml --env-file llama3-70b/.env up
```

Note: edit the .env file to change the model configuration

## Run the benchmark

Expand All @@ -33,4 +79,59 @@ The `model_id` must match the one corresponding to the selected `.env` file.
$ ./benchmark.sh NousResearch/Llama-2-7b-chat-hf 128
```

If you would like to run the benchmark script multiple times with different concurrent user parameters, you can use:

```
$ ./run_all.sh NousResearch/Meta-Llama-3-70B-Instruct
```

### Compiling the model

If you are trying to run a configuration or a model that is not available in the cache, you can compile the model before you run it, then load it locally.

See the [llama3-70b-trn1.32xlarge](llama3-70b-trn1.32xlarge) as an example.

It is best to compile the model with the software in the container you will be using to ensure all library versions match.

As an example, you can compile with the following command. **If you make changes, make sure your batch size, sequence length, and num_cores for compilation match the same settings in the .env file**

```
docker run -p 8080:80 \
-v $(pwd):/data \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
--device=/dev/neuron12 \
--device=/dev/neuron13 \
--device=/dev/neuron14 \
--device=/dev/neuron15 \
-ti \
--entrypoint "optimum-cli" neuronx-tgi:latest \
export neuron --model NousResearch/Meta-Llama-3-70B-Instruct \
--sequence_length 4096 \
--batch_size 4 \
--num_cores 32 \
/data/exportedmodel/
```

Note that the .env file has a path for MODEL_ID to load the model from the /data directory.

Also, the docker-compose.yaml file includes an additional parameter to map the volume to the current working directory, as well as additional Neuron device mappings because trn1.32xlarge has 32 cores (16 devices).

Make sure you run the above command and the docker compose command from the same directory since it maps the /data directory to the current working directory.

For this example:
```
$ docker compose -f llama3-70b-trn1.32xlarge/docker-compose.yaml --env-file llama3-70b-trn1.32xlarge/.env up
```


6 changes: 0 additions & 6 deletions benchmark/text-generation-inference/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ services:
environment:
- PORT=8081
- MODEL_ID=${MODEL_ID}
- HF_BATCH_SIZE=${HF_BATCH_SIZE}
- HF_SEQUENCE_LENGTH=${HF_SEQUENCE_LENGTH}
- HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
- HF_NUM_CORES=8
- MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
Expand All @@ -29,8 +27,6 @@ services:
environment:
- PORT=8082
- MODEL_ID=${MODEL_ID}
- HF_BATCH_SIZE=${HF_BATCH_SIZE}
- HF_SEQUENCE_LENGTH=${HF_SEQUENCE_LENGTH}
- HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
- HF_NUM_CORES=8
- MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
Expand All @@ -50,8 +46,6 @@ services:
environment:
- PORT=8083
- MODEL_ID=${MODEL_ID}
- HF_BATCH_SIZE=${HF_BATCH_SIZE}
- HF_SEQUENCE_LENGTH=${HF_SEQUENCE_LENGTH}
- HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
- HF_NUM_CORES=8
- MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
Expand Down
2 changes: 0 additions & 2 deletions benchmark/text-generation-inference/llama-7b/.env
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
MODEL_ID='NousResearch/Llama-2-7b-chat-hf'
HF_BATCH_SIZE=32
HF_SEQUENCE_LENGTH=4096
HF_AUTO_CAST_TYPE='fp16'
MAX_BATCH_SIZE=32
MAX_INPUT_LENGTH=3072
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
MODEL_ID='NousResearch/Meta-Llama-3-70B-Instruct'
HF_AUTO_CAST_TYPE='fp16'
MAX_BATCH_SIZE=4
MAX_INPUT_LENGTH=4000
MAX_TOTAL_TOKENS=4096
# MESSAGES_API_ENABLED='true' # Enable the messages API

Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
version: '3.7'

services:
tgi-1:
image: neuronx-tgi:latest
ports:
- "8080:8080"
environment:
- PORT=8080
- MODEL_ID=${MODEL_ID}
- HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
- HF_NUM_CORES=24
- MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
- MAX_INPUT_LENGTH=${MAX_INPUT_LENGTH}
- MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
- MAX_CONCURRENT_REQUESTS=512
devices:
- "/dev/neuron0"
- "/dev/neuron1"
- "/dev/neuron2"
- "/dev/neuron3"
- "/dev/neuron4"
- "/dev/neuron5"
- "/dev/neuron6"
- "/dev/neuron7"
- "/dev/neuron8"
- "/dev/neuron9"
- "/dev/neuron10"
- "/dev/neuron11"
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,30.170455300639418,0.7694021150018671,31.60879417184807
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,30.942167505908383,3.5238446079965797,42.88674224324184
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,31.216016638279726,11.17000110349909,70.63270124966144
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,31.442002397963858,28.047803349007154,138.61752441316904
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,31.622091010175804,60.1780687940045,290.1370155727129
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,31.734201827193452,123.7196121570014,523.1909448482422
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,31.72544803588566,250.5079138929941,1010.6170931343223
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,31.805759572717598,512.6742304505024,1997.340511562319
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,31.776200117214845,1025.654853393993,3954.0575741908333
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,31.715118036351587,2034.146784478493,8002.648273082725
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,18.818667211424472,1.3884793975012144,51.46871325828836
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,32.22257477833452,2.0121661404991755,56.734265583687296
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,50.19917175671667,5.205651430500438,66.04042245148653
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,52.13272738944358,9.568476632499369,97.32615035298838
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,53.59997031445967,26.087651531999654,191.19227161475598
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,56.08684244759754,61.25285707449984,310.16900484570965
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,57.40338464731561,129.3146581359997,560.2474255463762
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,58.39025853766574,267.3882590960002,1094.9986170264501
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,58.589480601098536,541.6153878579971,2147.5413489446523
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,58.69645477077839,1085.1772966810022,4231.7554182432905
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#MODEL_ID='NousResearch/Meta-Llama-3-70B-Instruct'
MODEL_ID='/data/exportedmodel'
HF_AUTO_CAST_TYPE='fp16'
MAX_BATCH_SIZE=4
MAX_INPUT_LENGTH=4000
MAX_TOTAL_TOKENS=4096
# MESSAGES_API_ENABLED='true' # Enable the messages API

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
version: '3.7'

services:
tgi-1:
image: neuronx-tgi:latest
ports:
- "8080:8080"
environment:
- PORT=8080
- MODEL_ID=${MODEL_ID}
- HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
- HF_NUM_CORES=32
- MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
- MAX_INPUT_LENGTH=${MAX_INPUT_LENGTH}
- MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
- MAX_CONCURRENT_REQUESTS=512
volumes:
- $PWD:/data
devices:
- "/dev/neuron0"
- "/dev/neuron1"
- "/dev/neuron2"
- "/dev/neuron3"
- "/dev/neuron4"
- "/dev/neuron5"
- "/dev/neuron6"
- "/dev/neuron7"
- "/dev/neuron8"
- "/dev/neuron9"
- "/dev/neuron10"
- "/dev/neuron11"
- "/dev/neuron12"
- "/dev/neuron13"
- "/dev/neuron14"
- "/dev/neuron15"

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,38.29638310438374,0.5521726660008426,24.784959740501066
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,38.98036959617541,2.72243953349971,32.827924415254174
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,39.39299322930307,8.926065296996967,63.795771842799695
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,39.85480734427003,22.479033984491252,110.33245410384168
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,39.797703130119444,48.74777327400079,218.4971534548553
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,39.88112179496438,98.32968477499526,419.0164926030421
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,40.021570341867225,201.50347035600862,787.0418267487788
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,40.15190355766733,412.9219288924942,1608.1377339868322
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,40.10404829156176,831.7238280020028,3167.7755826448656
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,39.94606130182408,1654.066714687011,6348.469898092637
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,17.8322790536497,0.9939256490033586,54.45429111182844
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,31.140113024869468,1.418605798491626,58.17940704286386
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,52.71447508703364,3.691673280511168,65.510341492747
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,85.23757246875635,7.40343523149204,79.86574747355823
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,83.41704442714865,12.134337133495137,119.80365178993138
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,86.31413401709217,33.19637775150477,221.51387761253872
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,91.54051788296289,78.17263232148252,378.5575452672668
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,93.59227409861985,163.85781266850245,709.4836254794548
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,94.49695504491365,332.89309809000406,1342.054465909721
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,94.76202310893393,671.8385370509932,2633.1926459323054
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_id,concurrent requests,throughput (t/s),Time-to-first-token @ P50 (s),average latency (ms)
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,1,27.321283482983713,0.9897541589998582,34.53017190612728
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,2,47.14780790833105,1.4317841799993403,38.47682874008382
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,4,75.46880157534952,3.7293467640001836,45.219761063884626
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,8,76.656177664245,6.710071522500584,67.5562098563004
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,16,78.10745154737947,18.174910198499674,130.32796764867985
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,32,80.94695720514072,42.99618862100033,211.52529640942643
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,64,83.41961944293132,90.68870028399942,387.7336944140728
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,128,84.68410927601217,187.20342993849863,761.1909438667759
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,256,85.08930039980858,376.98190486400017,1484.3806421055476
huggingface/NousResearch/Meta-Llama-3-70B-Instruct,512,84.99711473871804,758.8232675055006,2947.3092666464
2 changes: 0 additions & 2 deletions benchmark/text-generation-inference/mistral-7b/.env
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
MODEL_ID='mistralai/Mistral-7B-Instruct-v0.2'
HF_BATCH_SIZE=32
HF_SEQUENCE_LENGTH=4096
HF_AUTO_CAST_TYPE='bf16'
MAX_BATCH_SIZE=32
MAX_INPUT_LENGTH=3072
Expand Down
19 changes: 19 additions & 0 deletions benchmark/text-generation-inference/run_all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash -v
# This is made to be run on the Hugging Face DLAMI on an inferentia/trainium system

# at the end of this script, run
# python generate_csv.py

# change the modelname on the next line.
modelname=${1:-NousResearch/Llama-2-7b-chat-hf}
echo on
#set for your environment if not already set
#export LLMPerf=/home/ubuntu/llmperf

for concurrency in 1 2 4 8 16 32 64 128 256 512
do

./benchmark.sh ${modelname} ${concurrency}


done
2 changes: 1 addition & 1 deletion docs/source/guides/export_model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,7 @@ The NeuronX TGI image includes not only NeuronX runtime, but also all packages a
Use the following command to export a model to Neuron using a TGI image:

```
docker run --emtrypoint optimum-cli \
docker run --entrypoint optimum-cli \
-v $(pwd)/data:/data \
--privileged \
ghcr.io/huggingface/neuronx-tgi:latest \
Expand Down

0 comments on commit af0506f

Please sign in to comment.