Skip to content

Commit

Permalink
More on OpenMP.
Browse files Browse the repository at this point in the history
  • Loading branch information
athas committed Dec 21, 2024
1 parent df2aa64 commit c1a2404
Show file tree
Hide file tree
Showing 3 changed files with 347 additions and 15 deletions.
19 changes: 19 additions & 0 deletions notes.bib
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,22 @@ @book{muller2018handbook
year={2018},
publisher={Springer}
}

@article{theoryoflists,
author = {Bird, Richard},
year = {1987},
month = {06},
pages = {},
title = {An Introduction to the Theory of Lists},
volume = {36},
isbn = {978-3-642-87376-8},
journal = {Logic of Programming and Calculi of Discrete Design},
doi = {10.1007/978-3-642-87374-4_1}
}

@techreport{gibbons1995third,
title={The third homomorphism theorem},
author={Gibbons, Jeremy},
year={1995},
institution={Department of Computer Science, The University of Auckland, New Zealand}
}
233 changes: 218 additions & 15 deletions openmp.tex
Original file line number Diff line number Diff line change
Expand Up @@ -250,12 +250,12 @@ \section{Reductions}
\end{figure}

Note that all iterations of the loop update the same \texttt{sum}
variable. The reduction clause \texttt{reduction(+:sum)} that we
added to the OpenMP directive indicates that this update is done with
the \lstinline{+} operator. The compiler will transform this loop
such that each thread gets its own \emph{private} copy of
\texttt{sum}, which they then update independently. At the end, these
per-thread results are then combined to obtain the final result.
variable. The reduction clause \texttt{reduction(+:sum)} that we added
to the OpenMP directive indicates that this update is done with the
\lstinline{+} operator. The compiler will transform this loop such
that each thread gets its own \emph{private} copy of \texttt{sum},
which they then update independently. At the end, these per-thread
\emph{partial results} are then combined to obtain the final result.

Reduction clauses only immediately work with a small set of built-in
binary operators: \lstinline{+}, \lstinline{*},
Expand Down Expand Up @@ -497,6 +497,7 @@ \section{Parallel Partitioning}
them combine them at the end.

\subsection{List homomorphisms}
\label{sec:list-homomorphisms}

\newcommand\concat{\mathbin{+\mkern-10mu+}}

Expand Down Expand Up @@ -557,16 +558,24 @@ \subsection{List homomorphisms}
\]
\end{example}


This does not strictly speaking extend the expressivity of list
homomorphisms, as this behaviour could otherwise be encoded in
$\odot$.

Operationally, when computing a list homomorphism we can split the
input into any number of chunks, compute a result per chunk, and then
combine the results into a final result for the whole list. Each chunk
can be processed independently of the others, in parallel. We will now
turn our attention to how to implement this idea with OpenMP.
One useful property when implementing list homomorphisms on a computer
is that the ``recursive'' invocations of the homomorphism need not be
implemented in the same way, by splitting the input into tiny parts.
When computing a list homomorphism we can split the input into any
number of chunks, sequentially compute a result per chunk (perhaps
using a different algorithm), and then combine the results into a
final result for the whole list. Each chunk can be processed
independently of the others, in parallel. We will now turn our
attention to how to implement this idea with OpenMP.

For futher information on list homomorphisms, see Bird's \emph{An
Introduction to the Theory of Lists}~\cite{theoryoflists} and
Gibbons' \emph{The Third Homomorphism
Theorem}~\cite{gibbons1995third}.

\subsection{Explicit Work Partioning in OpenMP}

Expand Down Expand Up @@ -651,9 +660,34 @@ \subsection{Explicit Work Partioning in OpenMP}
will be visible to all threads, including any changes made to them.
Any memory allocated inside or outside the parallel region is also
accessible to all threads. We can use this to construct arbitrarily
complicated thread-local partial results.

\subsection{Thread-Local Partial Results}
complicated per-thread partial results.

\subsection{Summation with Parallel Regions}
\label{sec:parallel-summation}

We will now look at problems that are slightly less parallel, where
all elements of the input potentially contribute to each element of
the result.

The basic technique behind explicit work partioning is to allocate a
\emph{results array} with one element per (potential) thread, enter a
parallel region where each thread processes a chunk of the input and
writes a result to its corresponding element to the results array,
then after the final region we have a sequential loop that aggregates
the results array to a single final result.

\Cref{lst:openmp-partition-sum} shows an implementation of vector
summation using this technique. The integer array \texttt{sums}
contains an element for each potential thread. We use
\texttt{omp\_get\_max\_threads()} to retrieve the maximum number of
threads (\texttt{P}) that might potentially be used for the parallel
region. We must be careful \emph{not} to use
\texttt{omp\_get\_num\_threads()}, as this function returns the number
of threads used for the current parallel region - and as we are still
outside the parallel region when constructing the array,
\texttt{omp\_get\_num\_threads()} would return 1. The logic for
distributing the elements of the vector \texttt{A} is the same as we
saw in \cref{lst:openmp-vecadd-explicit}.

\begin{figure}
\lstinputlisting[
Expand All @@ -666,6 +700,175 @@ \subsection{Thread-Local Partial Results}
{src/openmp-partition-sum.c}
\end{figure}

\subsection{Filtering with parallel regions}

We have still accomplished nothing that could not also done with the
reductions explained in \cref{sec:reductions}. Let us now turn to a
problem that cannot be solved with reductions - specifically, let us
see how to filter an array to remove all negative elements, which as
we saw in \cref{sec:list-homomorphisms} is a list homomorphism.

First let us look at a sequential function for filtering lists. We
assume we are given an input array \texttt{src} of size \texttt{n}, as
well as an output array \texttt{dst} of at least size \texttt{n}. We
will copy the non-negative elements of \texttt{src} to \texttt{dst},
tallying how many have been copied in the variable \texttt{p}, and
finally return the number of elements written. The array \texttt{src}
can then be seen as a length-\texttt{p} array (with some trailing
garbage).

\lstinputlisting[
caption={Sequential filtering.},
label={lst:filter-sequential},
language=C,
frame=single,
firstline=6,
lastline=16]
{src/openmp-filter.c}

The idea behind a parallel filter is then to break the overall input
array apart into chunks, filter each chunk separately, then
concatenate the filtered chunks. We will represent the per-thread
results as \emph{two} arrays: an array of pointers to filtered arrays,
and an array of sizes of those arrays:

\begin{lstlisting}[
caption={constructing results arrays for parallel filtering.},
frame=single
]
int filter_openmp_limited(int n, int *dst, int *src) {
int P = omp_get_max_threads();
int* parts[P];
int parts_n[P];

for (int i = 0; i < P; i++) {
parts[i] = NULL;
parts_size[i] = 0;
}

// parallel region here...

// combine parallel results here...
}
\end{lstlisting}

Now let us turn to the main parallel region. Computing which part of
the input array should be filtered by each thread is done largely as
in previous examples, with a computation based on the thread number
and the total number of threads. To do the actual filtering, we reuse
the sequential \texttt{filter} function from
\cref{lst:filter-sequential}. It is quite common in this kind of
parallel programming to reuse sequential components within each
thread, in order to take advantage of already-optimised (and correct)
code. Each thread must allocate memory for its output array, and to be
able to handle the case where \emph{no} elements are removed, the size
of this array must be equal to the chunk size.

\begin{lstlisting}[
caption={The main parallel region for a parallel filter.},
label=lst:parallel-filter-region,
frame=single
]
#pragma omp parallel
{
int t = omp_get_thread_num();
int start = t * n/P;
int end = (t+1) * n/P;
if (t == omp_get_num_threads()-1) { end = n; }
int chunk_size = end - start;

parts[t] = malloc(sizeof(int) * chunk_size);
parts_n[t] = filter(chunk_size, parts[t], src+start);
}
\end{lstlisting}

We now face the problem of concatenating the \texttt{parts} arrays.
One straightforward solution is to simple do so sequentially, as
follows, where we also take care to deallocate the partial results
when we are done with them.

\begin{lstlisting}[
caption={Concatenating the filtered arrays, while also tallying up the total size of the result},
frame=single
]
int p = 0;
for (int i = 0; i < P; i++) {
memcpy(dst+p, parts[i], parts_n[i]*sizeof(int));
p += parts_n[i];
free(parts[i]);
}
return p;
\end{lstlisting}

Although it only runs for \texttt{P} iterations, the \texttt{memcpy}
operation is likely to be expensive. Because this loop both reads and
writes the output index \texttt{p}, it \emph{must} be sequential. For
some filtering problems this may be acceptable: if we expect that the
vast majority of elements have been removed, or if the filtering
predicate itself is very expensive, then it may be sufficient to
parallelise the filtering operation as we did in
\cref{lst:parallel-filter-region}. But for other workloads, the cost
of concatenating the partial results will be just as computationally
costly as the filtering, and so we would be well served to parallelise
it as well.

To do so, we break the loop into two. First we compute an integer
array \texttt{offsets}, where \texttt{offsets[t]} tells us where
thread \texttt{t} should write its partial result in \texttt{dst}.
This can be done with a sequential loop with \texttt{P} iterations,
but because we do very little work in each iteration of the loop, this
is unlikely to have much impact.

\begin{minipage}{\linewidth}
\begin{lstlisting}[
caption={Constructing the offset array.},
frame=single
]
int offsets[P];
offsets[0] = 0;
for (int t = 1; t < P; t++) {
offsets[t] = offsets[t-1] + parts_n[t-1];
}
\end{lstlisting}
\end{minipage}

Note that \lstinline{offsets[P-1]+parts_n[P-1]} is now the size of the
complete filtered array.

Now that we know the offsets for where each partial result should be
written in the destination array, it is straightforward to write a
parallel loop that does so.

\begin{lstlisting}[
caption={Concatenating the filtered arrays with a parallel loop.},
frame=single
]
#pragma omp parallel for
for (int t = 0; t < P; t++) {
memcpy(dst+offsets[t], parts[t],
parts_n[t]*sizeof(int));
free(parts[t]);
}
\end{lstlisting}

We now have a fully parallelised filter. Although the details differ,
the overall structure is the same as we saw for summation in
\cref{sec:parallel-summation}---split the input into parts, compute an
array of partial results, then combine the partial results. For
filtering, the combination of partial results is particularly
complicated, and most problems you will encounter will be easier. The
full function is shown in \cref{lst:filter-parallel}.

\begin{figure}
\lstinputlisting[
caption={Fully parallel filtering.},
label={lst:filter-parallel},
language=C,
frame=single,
firstline=50,
lastline=86]
{src/openmp-filter.c}
\end{figure}

\section{The Big Concurrency Problems}

Expand Down
Loading

0 comments on commit c1a2404

Please sign in to comment.