Fixed load issue and update docs for weight-only quantization with intel-extension-for-transformers #666

PenghuiCheng · 2024-04-17T05:39:37Z

What does this PR do?

This PR fixed load issue for weight-only quantized model and update documents for weight-only quantization with intel-extension-for-transformers.
This PR dependence on #658.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Signed-off-by: Cheng, Penghui <[email protected]>

echarlaix · 2024-04-17T16:36:46Z

examples/neural_compressor/text-generation/run_generation.py

@@ -281,6 +289,69 @@ def main():
    )
    parser.add_argument("--dataset_name", nargs="?", default="NeelNanda/pile-10k", const="NeelNanda/pile-10k")
    parser.add_argument("--calib_iters", default=100, type=int, help="calibration iters.")
+    parser.add_argument(


I would prefer to keep this example for post-training as ITREX is currently not a required dependency, what do you think about adding this example directly to https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch and add a link to these examples in the README? For example I see https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation.py

echarlaix · 2024-04-17T16:37:23Z

docs/source/optimization_inc.mdx

@@ -126,6 +126,33 @@ mpirun -np <number_of_processes> <RUN_CMD>

 Please refer to INC [documentation](https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#distributed-tuning) and [text-classification](https://github.com/huggingface/optimum-intel/tree/main/examples/neural_compressor/text-classification) example for more details. 

+## Weight-only quantization


Let's wait for this feature to be more stable before adding it to the documentation (currently not compatible with optimum-intel latest release)

echarlaix · 2024-04-17T16:41:01Z

optimum/intel/neural_compressor/quantization.py

-    if is_intel_extension_for_transformers_version("!=", INTEL_EXTENSION_FOR_TRANSFORMERS_MINIMUM_VERSION):
+    if is_intel_extension_for_transformers_version("<", INTEL_EXTENSION_FOR_TRANSFORMERS_MINIMUM_VERSION):
        raise ImportError(
            f"Found an incompatible version of `intel-extension-for-transformers`. Found version {_intel_extension_for_transformers_version}, "
-            f"but only version {INTEL_EXTENSION_FOR_TRANSFORMERS_MINIMUM_VERSION} is supported."
+            f"but only version {INTEL_EXTENSION_FOR_TRANSFORMERS_MINIMUM_VERSION} or higher is supported."


I think it makes sense to fix the ITREX version at the moment as it will avoid any undesired impact resulting from potential breaking changes from ITREX (as it was the case for ITREX v1.3.0 -> v1.4.0)

echarlaix · 2024-04-17T16:42:47Z

optimum/intel/neural_compressor/quantization.py

@@ -297,6 +297,7 @@ def quantize(
            )

            self._quantized_model.quantization_config = quantization_config
+            self._quantized_model.config.quantization_config = quantization_config


I think we should keep the model configs separated from the quantization config

PenghuiCheng added 3 commits April 17, 2024 13:33

Fixed load issue for woq model and update docs

31a3e53

Signed-off-by: Cheng, Penghui <[email protected]>

fixed typo

f205a59

Signed-off-by: Cheng, Penghui <[email protected]>

fix

4ebb7ab

Signed-off-by: Cheng, Penghui <[email protected]>

echarlaix reviewed Apr 17, 2024

View reviewed changes

echarlaix deleted the branch huggingface:update-itrex April 18, 2024 08:16

echarlaix closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed load issue and update docs for weight-only quantization with intel-extension-for-transformers #666

Fixed load issue and update docs for weight-only quantization with intel-extension-for-transformers #666

PenghuiCheng commented Apr 17, 2024

echarlaix Apr 17, 2024

echarlaix Apr 17, 2024

echarlaix Apr 17, 2024

echarlaix Apr 17, 2024

		@@ -126,6 +126,33 @@ mpirun -np <number_of_processes> <RUN_CMD>

		Please refer to INC [documentation](https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#distributed-tuning) and [text-classification](https://github.com/huggingface/optimum-intel/tree/main/examples/neural_compressor/text-classification) example for more details.

		## Weight-only quantization

Fixed load issue and update docs for weight-only quantization with intel-extension-for-transformers #666

Fixed load issue and update docs for weight-only quantization with intel-extension-for-transformers #666

Conversation

PenghuiCheng commented Apr 17, 2024

What does this PR do?

Before submitting

echarlaix Apr 17, 2024

Choose a reason for hiding this comment

echarlaix Apr 17, 2024

Choose a reason for hiding this comment

echarlaix Apr 17, 2024

Choose a reason for hiding this comment

echarlaix Apr 17, 2024

Choose a reason for hiding this comment