Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary C microtile is too small sometimes #850

Open
devinamatthews opened this issue Feb 5, 2025 · 1 comment
Open

Temporary C microtile is too small sometimes #850

devinamatthews opened this issue Feb 5, 2025 · 1 comment

Comments

@devinamatthews
Copy link
Member

A temporary C microtile is used in various places, such as on the diagonal of a symmetric matrix (GEMMT) where care must be taken not to write to the unstored portion. The size of this microtile is assumed to not be larger than twice the size of all vector registers together (on the assumption that "real" microtiles fit in registers plus some slack). However, several conditions cause a larger microtile to be written:

  • The CRR complex case, where a complex matrix is updated by a real product. The computation is performed in the real domain, which then updates a complex microtile twice as large.
  • Mixed-precision computation, for example computation is performed in single precision and then double precision output is written (which is again twice as large).

Currently, this means that a microtile may be "inflated" by as much as 4x. In the future, with a wider range of data types, this factor could be even larger.

This problems occurs concretely for the SKX configuration when doing zsssgemmt.

@devinamatthews
Copy link
Member Author

Note to self: Altra has a hard-coded max stack buf size which is bad.

devinamatthews added a commit that referenced this issue Feb 5, 2025
Details:
- See #850 for details on the problem.
- This is a temporary fix which should work for sdcz data types.
- Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.
devinamatthews added a commit that referenced this issue Feb 5, 2025
Details:
- See #850 for details on the problem.
- This is a temporary fix which should work for sdcz data types.
- Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.
devinamatthews added a commit that referenced this issue Feb 5, 2025
Details:
- See #850 for details on the problem.
- This is a temporary fix which should work for sdcz data types.
- Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.
devinamatthews added a commit that referenced this issue Feb 5, 2025
Details:
- See #850 for details on the problem.
- This is a temporary fix which should work for sdcz data types.
- Altra architectures may still not fully work for MP/MD as the stack buffer size is hard-coded.

(cherry picked from commit 5ad37a8)
devinamatthews added a commit that referenced this issue Feb 8, 2025
Details:
- This PR adds CircleCI testing in addition to TravisCI and Appveyor.
- All of the same tests as on Travis are run, except that different hardware typically ends up being used (usually Zen on Travis, Xeon Platinum on Circle). This has actually exposed a couple of bugs (see #850 and #852).
- The `travis` directory has been renamed to `ci` as it is now shared.
- Running SDE on CircleCI is a bit problematic because glibc changed how CPUID detection is done. This requires running some architectures with different hardware definition files and forcing a config via `BLIS_ARCH_TYPE`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant