Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README improvement #125

Merged
merged 2 commits into from
Nov 10, 2023
Merged

README improvement #125

merged 2 commits into from
Nov 10, 2023

Conversation

carstenbauer
Copy link
Contributor

See #124.

I've tried to clarify the README and have updated/rerun the examples on Julia 1.9 on a linux machine with 8 threads.

The sections about threadlocal and disabling polyester threads have not gotten much attention (but benchmarks in the latter have been rerun as well).

(I think the package needs an overhaul but this is beyond this PR which just tries to communicate the status quo a bit better.)

Copy link
Member

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great!


function axpy_polyester!(y, a, x)
@batch for i in eachindex(y,x)
y[i] = a * x[i] + y[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no muladd?
Because Julia takes a strong stance against unexpectedly giving people better performance and accuracy, I'm pro-normalizing-muladd.

Zen CPUs probably don't muladd much faster than separate mul and add, because they have 2 adders and 2 multiply units that can also fma.
For the muladd version on Zen, the adders are idle, so throughput is the same w/ respect to the backend (but fewer instructions and uops with muladd is friendlier to the frontend).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strongly opposed to using muladd, I just don't think it matters here. The purpose of this example is to demonstrate how to use @batch. Optimizing the axpy kernel itself is an orthogonal issue IMO and only a distraction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. FWIW, even on my intel system, muladd wasn't faster. axpy isn't compute bound.


## Important Notes

* `Polyester.@batch` moves arrays to threads by turning them into [StrideArraysCore.PtrArray](https://github.com/JuliaSIMD/StrideArraysCore.jl)s. This means that under an `@batch` slices will create `view`s by default(!). You may want to start Julia with `--check-bounds=yes` while debugging.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No automatic @inbounds as of StrideArraysCore.jl 0.5
but using --check-bounds=yes is good advice, anyway.
Especially because @batch adds @inbounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand.

  • Will arrays this be turned into PtrArray?
  • Will slices still default to view?

I'll add a note about @batch automatically adding @inbounds.

(I have to say I'm not a fan of adding automatic @inbounds and defaulting to views inside of @batch. For me, the latter is about low-overhead threading. The former two are optimizations that a user can add themself if they want to. But it's your package.)

@chriselrod
Copy link
Member

I think the package needs an overhaul but this is beyond this PR which just tries to communicate the status quo a bit better.

What do you have in mind?

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f29a42e) 91.13% compared to head (7832539) 91.13%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #125   +/-   ##
=======================================
  Coverage   91.13%   91.13%           
=======================================
  Files           3        3           
  Lines         451      451           
=======================================
  Hits          411      411           
  Misses         40       40           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@carstenbauer
Copy link
Contributor Author

What do you have in mind?

Apart from what I said above - I'd personally drop auto @inbounds and view behavior - threadlocal is a misnomer and very much broken. Also per is somewhat of a misnomer to me. Additionally, the disabling of polyester threads doesn't show the timing differences anymore that it used to which warrants a reconsideration of this section/feature. Statements about nesting should be generally revisited given that @threads defaults to :dynamic these days, which is nestable with ok-ish performance.

@chriselrod chriselrod merged commit 0952d11 into JuliaSIMD:master Nov 10, 2023
30 of 31 checks passed
@chriselrod
Copy link
Member

Thanks!

I don't really want to change PtrArray. I do not like copy-by-default, so even though it is now documented that slices of AbstractArrays are supposed to copy, maybe I'll make PtrArray no longer subtype AbstractArray.
We could drop the StrideArraysCore dependency and just make our own PtrArray type for being able to pass it to the tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants