Releases: huggingface/optimum-intel
v1.6.1: Patch release
v1.6.0: INC API refactorization
Refactorization of the INC API for neural-compressor v2.0 (#118)
The INCQuantizer
should be used to apply post-training (dynamic or static) quantization.
from transformers import AutoModelForQuestionAnswering
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer
model_name = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="dynamic")
quantizer = INCQuantizer.from_pretrained(model)
# Apply dynamic quantization and save the resulting model in the given directory
quantizer.quantize(quantization_config=quantization_config, save_directory="quantized_model")
The INCTrainer
should be used to apply and combine during training compression techniques such as pruning, quantization and distillation.
from transformers import TrainingArguments, default_data_collator
-from transformers import Trainer
+from optimum.intel.neural_compressor import INCTrainer
+from neural_compressor import QuantizationAwareTrainingConfig
# Load the quantization configuration detailing the quantization we wish to apply
+quantization_config = QuantizationAwareTrainingConfig()
-trainer = Trainer(
+trainer = INCTrainer(
model=model,
+ quantization_config=quantization_config,
args=TrainingArguments("quantized_model", num_train_epochs=3.0),
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=default_data_collator,
)
trainer.save_model()
To load a quantized model, you can just replace your AutoModelForXxx
class with the corresponding INCModelForXxx
class.
from optimum.intel.neural_compressor import INCModelForSequenceClassification
loaded_model_from_hub = INCModelForSequenceClassification.from_pretrained(
"Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
)
v1.5.5: Patch release
v1.5.4: Patch release
Fix the IPEX context-manager enabling inference mode, by returning the original model when IPEX cannot optimize it (#132)
v1.5.3: Patch release
v1.5.2: Patch release
v1.5.1: Patch release
v1.5.0: OpenVINO quantization
Quantization
- Add
OVQuantizer
enabling OpenVINO NNCF post-training static quantization (#50) - Add
OVTrainer
enabling OpenVINO NNCF quantization aware training (#67) - Add
OVConfig
the configuration which contains the quantization process informations (#65)
The quantized model resulting from the OVQuantizer
and the OVTrainer
are exported to the OpenVINO IR and can be loaded with the corresponding OVModelForXxx
to perform inference with OpenVINO Runtime.
OVModel
Add OVModelForCausalLM
enabling OpenVINO Runtime for models with a causal language modeling head (#76)
v1.4.0: OVModels for OpenVINO inference
OVModel classes were integrated with the 🤗 Hub in order to easily export models through the OpenVINO IR, save and load those resulting models, as well as to easily perform inference.
- Add OVModel classes enabling OpenVINO inference #21
Below is an example that downloads a DistilBERT model from the Hub, exports it through the OpenVINO IR and saves it:
from optimum.intel.openvino import OVModelForSequenceClassification
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = OVModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(save_directory)
The currently supported model topologies are the following :
OVModelForSequenceClassification
OVModelForTokenClassification
OVModelForQuestionAnswering
OVModelForFeatureExtraction
OVModelForMaskedLM
OVModelForImageClassification
OVModelForSeq2SeqLM
Pipelines
The Transformers pipelines support was added, providing an easy way to use OVModels for inference.
-from transformers import AutoModelForSeq2SeqLM
+from optimum.intel.openvino import OVModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline
model_id = "Helsinki-NLP/opus-mt-en-fr"
-model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
+model = OVModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
text = "He never went out without a book under his arm, and he often came back with two."
outputs = pipe(text)
By default, OVModels support dynamic shapes enabling inputs of every shapes (without any constraint on the batch size or sequence length). To decrease latency, static shapes can be enabled by giving the desired inputs shapes.
- Add OVModel static shapes #41
model.reshape(1, 20)
FP16 precision can also be enabled.
- Add OVModel fp16 support #45
model.half()