In this projects I have implemented training flow for GPT for single GPU using Huggingface Accelerate. Then I have exported the model to ONNX format for faster inference. Post that I have deployed it to Huggingface Spaces. Training logs can be seen from Wandb
https://huggingface.co/spaces/prerana1205/GPT-Inference
Inference Type | Time Taken for 1000 tokens |
---|---|
Pytorch Model | 83 secs |
Quantized Model | 81 secs |
Onnx Quantized | 56 secs |
Clone the project
git clone https://github.com/kurchi1205/GPT-Scratch.git
Go to the project directory
cd GPT-Scratch
Install dependencies
pip install -r requirements_train.txt
Start the training
./train.sh
python export.py
This will export the model to onnx and quantize it.
python generate.py #for pytorch model
python generate_onnx.py #for onnx model
I have implemented flash attention flow, although the cuda implementation is not there, the matrix slicing part has been implemented.