Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for histogram CPU implementation #1930

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
73 changes: 73 additions & 0 deletions rfcs/proposed/host_backend_histogram/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Host Backends Support for the Histogram APIs

## Introduction
In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and classify them into either evenly distributed or user-defined bins via a list of separating values and count the number of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be straightforward.
akukanov marked this conversation as resolved.
Show resolved Hide resolved

## Motivations
Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.

## Design Considerations

### Key Requirements
Provide support for the `histogram` APIs with the following policies and backends:
- Policies: `seq`, `unseq`, `par`, `par_unseq`
- Backends: `serial`, `tbb`, `openmp`

Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which they can select from when using oneDPL. It is important that all combinations of these options have support for the `histogram` APIs.

### Performance
As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case. Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely have a high impact on overall performance.

### Memory Footprint
There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because, as mentioned above, this will very likely be a memory bandwidth-bound API. In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins. We may be able to use that to our advantage.

### Code Reuse
Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else, this must be balanced with performance considerations.

### unseq Backend
As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may not provide much benefit, especially when we account for the extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make sense to decline to add vectorized operations within histogram depending on the implementation used, and based on performance results.

## Existing Patterns

### count_if
`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `reduce` is a very well-matched pattern performance-wise.

### parallel_for
`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use `parallel_for` alone, there would be a race condition between threads when incrementing the values in the output histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way to synchronize and accumulate between threads.

## Proposal
I believe there are two competing options for `histogram`, which may both have utility in the final implementation depending on the use case.

### Implementation One (Embarrassingly Parallel)
This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the `histogram`.
1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the backend.
2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can use the user-provided output data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determining/assuming a certain number of threads is not necessary, and should be avoided for the sake of parallel composability. We need a sort of thread-local storage, maybe the one that is accessible by other threads for data aggregation. It certainly can be done without knowing the number of threads.

Copy link
Contributor Author

@danhoeflinger danhoeflinger Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point for parallel composability.
However, I am curious about the details of making this happen in practice.
What do you think is the best way to be agnostic of the number of threads, but also having some reasonable control over the following:

  1. limiting memory footprint
  2. aggregating the data after accumulation

I think these may have different answers for TBB and OpenMP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some precedent for utilizing the maximum number of threads for accumulation in the OpenMP backend with transform_reduce (link). This is of course not to say it is correct or best.

Copy link
Contributor Author

@danhoeflinger danhoeflinger Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the OpenMP side, perhaps we shouldn't be using parallel_for but rather extending some of its building blocks for use by histogram and parallel_for. With more control over __chunk_partitioner and the parallel tasks / loop we should have what we need, and we can understand our number of local temporary copies / threads at that level.

For TBB, I'm more in the dark since that is more recursively decomposing into chunks , but looking further...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TBB, does it make sense to do something similar to this code?

The plan would be to Isolate then get the max concurrency of the task arena and allocating temporaries based on that. Then using current_thread_index() within a function brick which is decomposed directly or via a parallel_for pattern followed by a launch of tasks to accumulate those into the output.

It seems like this is not thinking very "tbb-like", because we would be interacting directly with thread indices and number of threads, which is discouraged. However, I this case I think its important to control the allocation while not sacrificing the ability to use a small grain size for work stealing and load balance.

Is there some more tbb-way of doing this which would have a similar control on footprint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, this is what I think I was missing...
https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneTBB/source/thread_local_storage/enumerable_thread_specific_cls.rst

It may make it difficult to understand where the appropriate heuristic for choosing between algorithms, but I think this is the "tbb way" of doing this which I was missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in TBB there is enumerable_thread_specific. But even without TBB, something like this could be done with e.g. a hash map protected by a lock. A thread ID/native handle would be the key to the map; and if it is guaranteed that insertions do not invalidate iterators/references to the map nodes, the lock would only need to be taken for getting such a reference, which could then be used to update the content without data races.

I am not sure if it's better to have a generic implementation like I outlined above, or if a special "interface" with backend-specific implementations is necessary for performance reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TBB, does it make sense to do something similar to this code?

This is acceptable, as it uses the upper limit on the number of threads, not an exact number of threads. Similarly, for OpenMP it is OK to query and use the number of threads in the "current" parallel region - it is not going to change during the algorithm execution.

3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into its own copy of the output sequence using the temporary storage to remove any race conditions.
4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the histogram into the output histogram sequence. This step is also embarrassingly parallel.
5) Deallocate temporary storage.

New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the specific backends which would allow a unified implementation for all host backends.

### Implementation Two (Atomics)
This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the output histogram data, we can merely run a `parallel_for` pattern.

To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own `atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic operations, but that is only available for the OpenMP backend.
Copy link
Contributor

@akukanov akukanov Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++17 provides std::atomic<T>, however, this can only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then copying it to the output data.

I guess you mean that if the output data type is not std::atomic, in C++17 there is no standard way to do atomic operations with these data.

We could look to implement our own atomic_ref<T> for C++17, but that would require specialization for individual compilers.

I think we would not need a full-fledged atomic_ref, we would need something like "atomic_cast", and - given that the output data are counters of integral type - that should not be a huge deal.

Copy link
Contributor Author

@danhoeflinger danhoeflinger Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. I haven't done a deep look into all compilers / standard library implementations in our matrix at this point. It may be as simple as a reinterpret cast in most cases, but that isn't guaranteed by the standard to work. I agree this is worth looking at, and may not be so bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a little about my proposed strategy for handling atomics. I think the best option is to only worry about what we need which is explicitly an atomic increment, and do this using in the best method available. This allows us the simplest path forward.


It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend on the different approaches available.

### Selecting Between Algorithms
It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.

## Alternative Approaches
* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to modifying them. It's possible such an approach could provide a similar approach to atomics, but with different overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.

* Another possible approach could be to do something like the proposed implementation one, but with some sparse representation of output data. However, I think the general assumptions we can make about the normal case make this less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach does not provide a good fallback.

## Open Questions
* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require specializations for each of our supported compilers.

* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?

* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?

* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector instructions?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these are interesting, and mainly follow similar lines to the approaches outlined here. I will look more into the pcwar SIMD instruction to understand if that is viable to use here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While its interesting to explore, I believe we basically depend on OpenMP simd to provide our SIMD operations. We won't be able to serve more complex intrinsic operations if we want to stick to that.

Further, I'm not actually finding anything like the pcwar instruction(s) they refer to here in https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. I do find some for conflict detection across lane which could be useful, but again those won't be available through the interface of OpenMP.

I think our options are basically to decline SIMD or to have duplicates of the output for each SIMD lane. I think even that may be more "hands on" with SIMD details than we have done thus far from oneDPL.

Loading