Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong scale eps applied #1770

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexsamardzic
Copy link
Collaborator

Fixes #1766.

Copy link

pytorch-bot bot commented Feb 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1770

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a66f34d with merge base 2a3fbff (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2025
@alexsamardzic alexsamardzic added float8 topic: bug fix Use this tag for PRs that fix bugs labels Feb 24, 2025
@alexsamardzic alexsamardzic marked this pull request as draft February 24, 2025 18:50
@alexsamardzic
Copy link
Collaborator Author

Please don't merge yet, this isn't good enough...

@alexsamardzic alexsamardzic marked this pull request as ready for review February 24, 2025 21:48
@alexsamardzic
Copy link
Collaborator Author

Ok, I think it could be reviewed now. Basically, in calculate_scale_eps_for_dtype(), scale calculations are kind of emulated, and the minimum value that won't produce an Inf when reciprocated is returned. This should produce such eps value for choose_qparams_affine() that would calculate the scale so that the range of quantized values is maximized, but that scale reciprocal, used when given tensor actually quantized, doesn't become Inf.

Now, this is all probably an overkill: it's pretty much relevant only for float16 inputs, it could be that it fixes only one of several quantization code paths, etc. So maybe I just put this in my #1671 for now, specifically for the quantization type where I've encountered the issue while I was testing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. float8 topic: bug fix Use this tag for PRs that fix bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QST] About NaNs generated during FP16->FP8 quantization
3 participants