Accelerate your machine learning and deep learning models by upto 10X
🔥UPDATE: Stable-Diffusion/DreamBooth Acceleration. Upto 2.5X speed up in inference🔥
voltaML is an open-source lightweight library to accelerate your machine learning and deep learning models. VoltaML can optimize, compile and deploy your models to your target CPU and GPU devices, with just one line of code.
✅ FP16 Quantization
✅ Int8 Quantization*
✅ Hardware specific compilation
voltaML has compilation support for the following:
Requirements:
- CUDA Version >11.x
- TensorRT == 8.4.1.2
- PyTorch == 1.12 cu11.x
- NVIDIA Driver version > 510
git clone https://github.com/VoltaML/voltaML.git
cd voltaML
python setup.py install
docker pull voltaml/voltaml:v0.4
docker run -it --gpus=all -p "8888:8888" voltaml/voltaml:v0.4 \
jupyter lab --port=8888 --no-browser --ip 0.0.0.0 --allow-root
import torch
from voltaml.compile import VoltaGPUCompiler, VoltaCPUCompiler, TVMCompiler
from voltaml.inference import gpu_performance
model = torch.load("path/to/model/dir")
# compile the model by giving paths
compiler = VoltaGPUCompiler(
model=model,
output_dir="destination/path/of/compiled/model",
input_shape=(1, 3, 224, 224), # example input shape
precision="fp16" # specify precision[fp32, fp16, int8] - Only for GPU compiler
target="llvm" # specify target device - Only for TVM compiler
)
# returns the compiled model
compiled_model = compiler.compile()
# compute and compare performance
gpu_performance(compiled_model, model, input_shape=(1, 3, 224, 224))
cpu_performance(compiled_model, model, compiler="voltaml", input_shape=(1, 3, 224, 224))
cpu_performance(compiled_model, model, compiler="tvm", input_shape=(1, 3, 224, 224))
- ResNet-50 image classification
- DeeplabV3_MobileNet_v3_Large Segmentation
- YOLOv5 Object Detection YOLOv5
- YOLOv6 Object Detection YOLOv6
- Bert_Base_Uncased Huggingface
Classification has been done on Imagenet data, batch size = 1
and imagesize = 224
on NVIDIA RTX 2080Ti. In terms of top 1% and 5% accuracy for int8
models, we have not seen an accuracy drop of more than 1%.
Model | Pytorch (ms) | VoltaGPU FP16 (ms) | VoltaGPU int8 (ms) | Pytorch vs Int8 Speed |
---|---|---|---|---|
squeezenet1_1 | 1.6 | 0.2 | 0.2 | 8.4x |
resnet18 | 2.7 | 0.4 | 0.3 | 9.0x |
resnet34 | 4.5 | 0.7 | 0.5 | 9.0x |
resnet50 | 6.6 | 0.7 | 0.5 | 13.2x |
resnet101 | 13.6 | 1.3 | 1.0 | 13.6x |
densenet121 | 15.7 | 2.4 | 2.0 | 7.9x |
densenet169 | 22.0 | 4.4 | 3.8 | 5.8x |
densenet201 | 26.8 | 6.3 | 5.0 | 5.4x |
vgg11 | 2.0 | 0.9 | 0.5 | 4.0x |
vgg16 | 3.5 | 1.2 | 0.7 | 5.0x |
Object Detection inference was done on a dummy data with imagesize = 640
and batch size = 1
on NVIDIA RTX 2080Ti.
Model | Pytorch (ms) | VoltaGPU FP16 (ms) | Pytorch vs FP16 Speed |
---|---|---|---|
YOLOv5n | 5.2 | 1.2 | 4.3x |
YOLOv5s | 5.1 | 1.6 | 3.2x |
YOLOv5m | 9.1 | 3.2 | 2.8x |
YOLOv5l | 15.3 | 5.1 | 3.0x |
YOLOv5x | 30.8 | 6.4 | 4.8x |
YOLOv6s | 8.8 | 3.0 | 2.9x |
YOLOv6l_relu | 23.4 | 5.5 | 4.3x |
YOLOv6l | 18.1 | 4.1 | 4.4x |
YOLOv6n | 9.1 | 1.6 | 5.7x |
YOLOv6t | 8.6 | 2.4 | 3.6x |
YOLOv5m | 15.5 | 3.5 | 4.4x |
Segmentation inference was done on a dummy data with imagesize = 224
and batch size = 1
on NVIDIA RTX 2080Ti.
Model | Pytorch (ms) | VoltaGPU FP16 (ms) | VoltaGPU Int8 (ms) | Speed Up (X) |
---|---|---|---|---|
FCN_Resnet50 | 8.3 | 2.3 | 1.8 | 3.6x |
FCN_Resnet101 | 14.7 | 3.5 | 2.5 | 5.9x |
DeeplabV3_Resnet50 | 12.1 | 2.5 | 1.3 | 9.3x |
DeeplabV3_Resnet101 | 18.7 | 3.6 | 2.0 | 9.4x |
DeeplabV3_MobileNetV3_Large | 6.1 | 1.5 | 0.8 | 7.6x |
DeeplabV3Plus_ResNet50 | 6.1 | 1.1 | 0.8 | 7.6x |
DeeplabV3Plus_ResNet34 | 4.7 | 0.9 | 0.8 | 5.9x |
UNet_ResNet50 | 6.2 | 1.3 | 1 | 6.2x |
UNet_ResNet34 | 4.3 | 1.1 | 0.8 | 5.4x |
FPN_ResNet50 | 5.5 | 1.2 | 1 | 5.5x |
FPN_ResNet34 | 4.2 | 1.1 | 1 | 4.2x |
We're adding support to accelerate Huggingface NLP models with voltaML. This work has been inspired from ELS-RD's work. This is still in the early stages and only few models listed in the below table are supported. We're working to add more models soon.
from voltaml.compile import VoltaNLPCompile
from voltaml.inference import nlp_performance
model='bert-base-cased'
backend=["tensorrt","onnx"]
seq_len=[1, 1, 1]
task="classification"
batch_size=[1,1,1]
VoltaNLPCompile(model=model, device='cuda', backend=backend, seq_len=seq_len)
nlp_performance(model=model, device='cuda', backend=backend, seq_len=seq_len)
Model | Pytorch (ms) | VoltaML FP16 (ms) | SpeedUp |
---|---|---|---|
bert-base-uncased | 6.4 | 1 | 6.4x |
Jean-Baptiste/camembert-ner | 6.3 | 1 | 6.3x |
gpt2 | 6.6 | 1.2 | 5.5x |
xlm-roberta-base | 6.4 | 1.08 | 5.9x |
roberta-base | 6.6 | 1.09 | 6.1x |
bert-base-cased | 6.2 | 0.9 | 6.9x |
distilbert-base-uncased | 3.5 | 0.6 | 5.8x |
roberta-large | 11.9 | 2.4 | 5.0x |
deepset/xlm-roberta-base-squad2 | 6.2 | 1.08 | 5.7x |
cardiffnlp/twitter-roberta-base-sentiment | 6 | 1.07 | 5.6x |
sentence-transformers/all-MiniLM-L6-v2 | 3.2 | 0.42 | 7.6x |
bert-base-chinese | 6.3 | 0.97 | 6.5x |
distilbert-base-uncased-finetuned-sst-2-english | 3.4 | 0.6 | 5.7x |
albert-base-v2 | 6.7 | 1 | 6.7x |
voltaTrees ⚡🌴 -> Link
A LLVM-based compiler for XGBoost and LightGBM decision trees.
voltatrees
converts trained XGBoost and LightGBM models to optimized machine code, speeding-up prediction by ≥10x.
import voltatrees as vt
model = vt.XGBoostRegressor.Model(model_file="NYC_taxi/model.txt")
model.compile()
model.predict(df)
git clone git clone https://github.com/VoltaML/volta-trees.git
cd volta-trees/
pip install -e .
On smaller datasets, voltaTrees is 2-3X faster than Treelite by DMLC. Testing on large scale dataset is yet to be conducted.
Any enterprise customers who would like a fully managed solution hosted on your own cloud, please contact us at [email protected]
- Fully managed and cloud-hosted optimization engine.
- Hardware targeted optimised dockers for maximum performance.
- One-click deployment of the compiled models.
- Cost-benefit analysis dashboard for optimal deployment.
- NVIDIA Triton optimzed dockers for large-scale GPU deployment.
- Quantization-Aware-Training (QAT)