Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for histogram CPU implementation #1930
base: main
Are you sure you want to change the base?
RFC for histogram CPU implementation #1930
Changes from 9 commits
a03577e
ce117f5
ccc001e
d518a14
6e03468
10c4e50
efa7c9b
1ac82fd
02523c4
ac7b654
506fb62
1c6cb47
ceee3e3
0711090
3c5ad12
06a734f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determining/assuming a certain number of threads is not necessary, and should be avoided for the sake of parallel composability. We need a sort of thread-local storage, maybe the one that is accessible by other threads for data aggregation. It certainly can be done without knowing the number of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point for parallel composability.
However, I am curious about the details of making this happen in practice.
What do you think is the best way to be agnostic of the number of threads, but also having some reasonable control over the following:
I think these may have different answers for TBB and OpenMP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some precedent for utilizing the maximum number of threads for accumulation in the OpenMP backend with
transform_reduce
(link). This is of course not to say it is correct or best.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the OpenMP side, perhaps we shouldn't be using
parallel_for
but rather extending some of its building blocks for use byhistogram
andparallel_for
. With more control over__chunk_partitioner
and the parallel tasks / loop we should have what we need, and we can understand our number of local temporary copies / threads at that level.For TBB, I'm more in the dark since that is more recursively decomposing into chunks , but looking further...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For TBB, does it make sense to do something similar to this code?
The plan would be to Isolate then get the max concurrency of the task arena and allocating temporaries based on that. Then using
current_thread_index()
within a function brick which is decomposed directly or via aparallel_for
pattern followed by a launch of tasks to accumulate those into the output.It seems like this is not thinking very "tbb-like", because we would be interacting directly with thread indices and number of threads, which is discouraged. However, I this case I think its important to control the allocation while not sacrificing the ability to use a small grain size for work stealing and load balance.
Is there some more tbb-way of doing this which would have a similar control on footprint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, this is what I think I was missing...
https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneTBB/source/thread_local_storage/enumerable_thread_specific_cls.rst
It may make it difficult to understand where the appropriate heuristic for choosing between algorithms, but I think this is the "tbb way" of doing this which I was missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in TBB there is
enumerable_thread_specific
. But even without TBB, something like this could be done with e.g. a hash map protected by a lock. A thread ID/native handle would be the key to the map; and if it is guaranteed that insertions do not invalidate iterators/references to the map nodes, the lock would only need to be taken for getting such a reference, which could then be used to update the content without data races.I am not sure if it's better to have a generic implementation like I outlined above, or if a special "interface" with backend-specific implementations is necessary for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is acceptable, as it uses the upper limit on the number of threads, not an exact number of threads. Similarly, for OpenMP it is OK to query and use the number of threads in the "current" parallel region - it is not going to change during the algorithm execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean that if the output data type is not
std::atomic
, in C++17 there is no standard way to do atomic operations with these data.I think we would not need a full-fledged
atomic_ref
, we would need something like "atomic_cast", and - given that the output data are counters of integral type - that should not be a huge deal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. I haven't done a deep look into all compilers / standard library implementations in our matrix at this point. It may be as simple as a reinterpret cast in most cases, but that isn't guaranteed by the standard to work. I agree this is worth looking at, and may not be so bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a little about my proposed strategy for handling atomics. I think the best option is to only worry about what we need which is explicitly an atomic increment, and do this using in the best method available. This allows us the simplest path forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of papers that might be of interest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these are interesting, and mainly follow similar lines to the approaches outlined here. I will look more into the
pcwar
SIMD instruction to understand if that is viable to use here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While its interesting to explore, I believe we basically depend on OpenMP simd to provide our SIMD operations. We won't be able to serve more complex intrinsic operations if we want to stick to that.
Further, I'm not actually finding anything like the pcwar instruction(s) they refer to here in https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. I do find some for conflict detection across lane which could be useful, but again those won't be available through the interface of OpenMP.
I think our options are basically to decline SIMD or to have duplicates of the output for each SIMD lane. I think even that may be more "hands on" with SIMD details than we have done thus far from oneDPL.