More on OpenMP.

diku-dk · Dec 21, 2024 · c1a2404 · c1a2404
1 parent df2aa64
commit c1a2404
Show file tree

Hide file tree

Showing 3 changed files with 347 additions and 15 deletions.
diff --git a/notes.bib b/notes.bib
@@ -80,3 +80,22 @@ @book{muller2018handbook
   year={2018},
   publisher={Springer}
 }
+
+@article{theoryoflists,
+author = {Bird, Richard},
+year = {1987},
+month = {06},
+pages = {},
+title = {An Introduction to the Theory of Lists},
+volume = {36},
+isbn = {978-3-642-87376-8},
+journal = {Logic of Programming and Calculi of Discrete Design},
+doi = {10.1007/978-3-642-87374-4_1}
+}
+
+@techreport{gibbons1995third,
+  title={The third homomorphism theorem},
+  author={Gibbons, Jeremy},
+  year={1995},
+  institution={Department of Computer Science, The University of Auckland, New Zealand}
+}
diff --git a/openmp.tex b/openmp.tex
@@ -250,12 +250,12 @@ \section{Reductions}
 \end{figure}
 
 Note that all iterations of the loop update the same \texttt{sum}
-variable.  The reduction clause \texttt{reduction(+:sum)} that we
-added to the OpenMP directive indicates that this update is done with
-the \lstinline{+} operator.  The compiler will transform this loop
-such that each thread gets its own \emph{private} copy of
-\texttt{sum}, which they then update independently.  At the end, these
-per-thread results are then combined to obtain the final result.
+variable. The reduction clause \texttt{reduction(+:sum)} that we added
+to the OpenMP directive indicates that this update is done with the
+\lstinline{+} operator. The compiler will transform this loop such
+that each thread gets its own \emph{private} copy of \texttt{sum},
+which they then update independently. At the end, these per-thread
+\emph{partial results} are then combined to obtain the final result.
 
 Reduction clauses only immediately work with a small set of built-in
 binary operators: \lstinline{+}, \lstinline{*},
@@ -497,6 +497,7 @@ \section{Parallel Partitioning}
 them combine them at the end.
 
 \subsection{List homomorphisms}
+\label{sec:list-homomorphisms}
 
 \newcommand\concat{\mathbin{+\mkern-10mu+}}
 
@@ -557,16 +558,24 @@ \subsection{List homomorphisms}
   \]
 \end{example}
 
-
 This does not strictly speaking extend the expressivity of list
 homomorphisms, as this behaviour could otherwise be encoded in
 $\odot$.
 
-Operationally, when computing a list homomorphism we can split the
-input into any number of chunks, compute a result per chunk, and then
-combine the results into a final result for the whole list. Each chunk
-can be processed independently of the others, in parallel. We will now
-turn our attention to how to implement this idea with OpenMP.
+One useful property when implementing list homomorphisms on a computer
+is that the ``recursive'' invocations of the homomorphism need not be
+implemented in the same way, by splitting the input into tiny parts.
+When computing a list homomorphism we can split the input into any
+number of chunks, sequentially compute a result per chunk (perhaps
+using a different algorithm), and then combine the results into a
+final result for the whole list. Each chunk can be processed
+independently of the others, in parallel. We will now turn our
+attention to how to implement this idea with OpenMP.
+
+For futher information on list homomorphisms, see Bird's \emph{An
+  Introduction to the Theory of Lists}~\cite{theoryoflists} and
+Gibbons' \emph{The Third Homomorphism
+  Theorem}~\cite{gibbons1995third}.
 
 \subsection{Explicit Work Partioning in OpenMP}
 
@@ -651,9 +660,34 @@ \subsection{Explicit Work Partioning in OpenMP}
 will be visible to all threads, including any changes made to them.
 Any memory allocated inside or outside the parallel region is also
 accessible to all threads. We can use this to construct arbitrarily
-complicated thread-local partial results.
-
-\subsection{Thread-Local Partial Results}
+complicated per-thread partial results.
+
+\subsection{Summation with Parallel Regions}
+\label{sec:parallel-summation}
+
+We will now look at problems that are slightly less parallel, where
+all elements of the input potentially contribute to each element of
+the result.
+
+The basic technique behind explicit work partioning is to allocate a
+\emph{results array} with one element per (potential) thread, enter a
+parallel region where each thread processes a chunk of the input and
+writes a result to its corresponding element to the results array,
+then after the final region we have a sequential loop that aggregates
+the results array to a single final result.
+
+\Cref{lst:openmp-partition-sum} shows an implementation of vector
+summation using this technique. The integer array \texttt{sums}
+contains an element for each potential thread. We use
+\texttt{omp\_get\_max\_threads()} to retrieve the maximum number of
+threads (\texttt{P}) that might potentially be used for the parallel
+region. We must be careful \emph{not} to use
+\texttt{omp\_get\_num\_threads()}, as this function returns the number
+of threads used for the current parallel region - and as we are still
+outside the parallel region when constructing the array,
+\texttt{omp\_get\_num\_threads()} would return 1. The logic for
+distributing the elements of the vector \texttt{A} is the same as we
+saw in \cref{lst:openmp-vecadd-explicit}.
 
 \begin{figure}
 \lstinputlisting[
@@ -666,6 +700,175 @@ \subsection{Thread-Local Partial Results}
 {src/openmp-partition-sum.c}
 \end{figure}
 
+\subsection{Filtering with parallel regions}
+
+We have still accomplished nothing that could not also done with the
+reductions explained in \cref{sec:reductions}. Let us now turn to a
+problem that cannot be solved with reductions - specifically, let us
+see how to filter an array to remove all negative elements, which as
+we saw in \cref{sec:list-homomorphisms} is a list homomorphism.
+
+First let us look at a sequential function for filtering lists. We
+assume we are given an input array \texttt{src} of size \texttt{n}, as
+well as an output array \texttt{dst} of at least size \texttt{n}. We
+will copy the non-negative elements of \texttt{src} to \texttt{dst},
+tallying how many have been copied in the variable \texttt{p}, and
+finally return the number of elements written. The array \texttt{src}
+can then be seen as a length-\texttt{p} array (with some trailing
+garbage).
+
+\lstinputlisting[
+caption={Sequential filtering.},
+label={lst:filter-sequential},
+language=C,
+frame=single,
+firstline=6,
+lastline=16]
+{src/openmp-filter.c}
+
+The idea behind a parallel filter is then to break the overall input
+array apart into chunks, filter each chunk separately, then
+concatenate the filtered chunks. We will represent the per-thread
+results as \emph{two} arrays: an array of pointers to filtered arrays,
+and an array of sizes of those arrays:
+
+\begin{lstlisting}[
+caption={constructing results arrays for parallel filtering.},
+frame=single
+]
+int filter_openmp_limited(int n, int *dst, int *src) {
+  int P = omp_get_max_threads();
+  int* parts[P];
+  int parts_n[P];
+
+  for (int i = 0; i < P; i++) {
+    parts[i] = NULL;
+    parts_size[i] = 0;
+  }
+
+  // parallel region here...
+
+  // combine parallel results here...
+}
+\end{lstlisting}
+
+Now let us turn to the main parallel region. Computing which part of
+the input array should be filtered by each thread is done largely as
+in previous examples, with a computation based on the thread number
+and the total number of threads. To do the actual filtering, we reuse
+the sequential \texttt{filter} function from
+\cref{lst:filter-sequential}. It is quite common in this kind of
+parallel programming to reuse sequential components within each
+thread, in order to take advantage of already-optimised (and correct)
+code. Each thread must allocate memory for its output array, and to be
+able to handle the case where \emph{no} elements are removed, the size
+of this array must be equal to the chunk size.
+
+\begin{lstlisting}[
+caption={The main parallel region for a parallel filter.},
+label=lst:parallel-filter-region,
+frame=single
+]
+#pragma omp parallel
+  {
+    int t = omp_get_thread_num();
+    int start = t * n/P;
+    int end = (t+1) * n/P;
+    if (t == omp_get_num_threads()-1) { end = n; }
+    int chunk_size = end - start;
+
+    parts[t] = malloc(sizeof(int) * chunk_size);
+    parts_n[t] = filter(chunk_size, parts[t], src+start);
+  }
+\end{lstlisting}
+
+We now face the problem of concatenating the \texttt{parts} arrays.
+One straightforward solution is to simple do so sequentially, as
+follows, where we also take care to deallocate the partial results
+when we are done with them.
+
+\begin{lstlisting}[
+caption={Concatenating the filtered arrays, while also tallying up the total size of the result},
+frame=single
+]
+  int p = 0;
+  for (int i = 0; i < P; i++) {
+    memcpy(dst+p, parts[i], parts_n[i]*sizeof(int));
+    p += parts_n[i];
+    free(parts[i]);
+  }
+  return p;
+\end{lstlisting}
+
+Although it only runs for \texttt{P} iterations, the \texttt{memcpy}
+operation is likely to be expensive. Because this loop both reads and
+writes the output index \texttt{p}, it \emph{must} be sequential. For
+some filtering problems this may be acceptable: if we expect that the
+vast majority of elements have been removed, or if the filtering
+predicate itself is very expensive, then it may be sufficient to
+parallelise the filtering operation as we did in
+\cref{lst:parallel-filter-region}. But for other workloads, the cost
+of concatenating the partial results will be just as computationally
+costly as the filtering, and so we would be well served to parallelise
+it as well.
+
+To do so, we break the loop into two. First we compute an integer
+array \texttt{offsets}, where \texttt{offsets[t]} tells us where
+thread \texttt{t} should write its partial result in \texttt{dst}.
+This can be done with a sequential loop with \texttt{P} iterations,
+but because we do very little work in each iteration of the loop, this
+is unlikely to have much impact.
+
+\begin{minipage}{\linewidth}
+\begin{lstlisting}[
+caption={Constructing the offset array.},
+frame=single
+]
+  int offsets[P];
+  offsets[0] = 0;
+  for (int t = 1; t < P; t++) {
+    offsets[t] = offsets[t-1] + parts_n[t-1];
+  }
+\end{lstlisting}
+\end{minipage}
+
+Note that \lstinline{offsets[P-1]+parts_n[P-1]} is now the size of the
+complete filtered array.
+
+Now that we know the offsets for where each partial result should be
+written in the destination array, it is straightforward to write a
+parallel loop that does so.
+
+\begin{lstlisting}[
+caption={Concatenating the filtered arrays with a parallel loop.},
+frame=single
+]
+#pragma omp parallel for
+  for (int t = 0; t < P; t++) {
+    memcpy(dst+offsets[t], parts[t],
+           parts_n[t]*sizeof(int));
+    free(parts[t]);
+  }
+\end{lstlisting}
+
+We now have a fully parallelised filter. Although the details differ,
+the overall structure is the same as we saw for summation in
+\cref{sec:parallel-summation}---split the input into parts, compute an
+array of partial results, then combine the partial results. For
+filtering, the combination of partial results is particularly
+complicated, and most problems you will encounter will be easier. The
+full function is shown in \cref{lst:filter-parallel}.
+
+\begin{figure}
+\lstinputlisting[
+caption={Fully parallel filtering.},
+label={lst:filter-parallel},
+language=C,
+frame=single,
+firstline=50,
+lastline=86]
+{src/openmp-filter.c}
+\end{figure}
 
 \section{The Big Concurrency Problems}