diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5d5756c --- /dev/null +++ b/.gitignore @@ -0,0 +1,8 @@ +*.aux +*.bbl +*.blg +*.log +*.out +*.pdf +*.thm +*.toc diff --git a/README.md b/README.md new file mode 100644 index 0000000..8b67b2b --- /dev/null +++ b/README.md @@ -0,0 +1,3 @@ +# Course notes for HPPS + +A slowly growing collection of material for use in HPPS. diff --git a/compiled_interpreted.tex b/compiled_interpreted.tex new file mode 100644 index 0000000..9e2ba28 --- /dev/null +++ b/compiled_interpreted.tex @@ -0,0 +1,595 @@ +\chapter{Compiled and Interpreted Languages} + +A computer can directly execute only machine code, consisting of raw +numeric data. Machine code can be written by humans, but we usually +use symbolic \textit{assembly languages} to make it more approachable. +However, even when using an assembly language, this form of +programming is very tedious. This is because assembly languages are +(almost) a transparent layer of syntax on top of the raw machine code, +and the machine code has been designed to be efficient to +\textit{execute}, not to be a pleasant programming experience. +Specifically, we are programming at a very low level of abstraction +when we use assembly languages, and with no good ability to build new +abstractions. In practice, almost all progamming is conducted in +\textit{high-level languages}. + +\section{Low-level and High-Level Languages} + +For the purpose of this chapter, a high-level programming language is +a language that is designed not to directly represent the capabilities +and details of some machine, but rather to \textit{abstract} the +mechanical details, in order to make programming simpler. However, we +should note that ``high-level'' is a spectrum. In general, the +meaning of the term ``high-level programming language'' depends on the +speaker and the context (\cref{fig:koronkebitch}). The pioneering +computer scientist Alan Perlis said: \textit{``A programming language + is low-level when its programs require attention to the + irrelevant''}. During the course you will gain familiarity with the +programming language C, which \textit{definitely} requires you to pay +attention to things that are often considered irrelevant, which makes +it low-level in Perlis's eyes. However, we will see that the control +offered by C provides some capabilities, mostly the ability to +\textit{tune} our code for high performance, that are for some +problems \textit{not} irrelevant. The term \textit{mid-level + programming language} might be a good description of C, as it fills +a niche between low-level assembly languages, and high-level languages +such as Python and F\#. + +\begin{figure} + \centering + \includegraphics[width=\textwidth]{img/highleveltweet.png} + \caption{A remark on the clarity of terms in computer science.} + \label{fig:koronkebitch} +\end{figure} + +Generally speaking, low-level languages tend to be more +\textit{difficult} to program in, while offering greater potential +\textit{performance} (i.e. programs written in them are faster). +Higher-level languages are much easier to program in, but run slower +and require more machine resources (e.g. memory). Given the speed of +modern computers, this is a price we are often willing to +pay---especially in the common case where the slowest part of our +program is waiting for information from disk or network. Do not make +the mistake of assuming that a program written in a low-level language +is \textit{always} faster than one written in a high-level language. +Choice of algorithm is often more important than choice of language. +Further, some high-level languages are specifically designed to +execute very efficiently. But there is no free lunch: these languages +make tradeoffs in other areas. There is no objective measure of where +a language lies on the scale of ``level-ness'', so while a statement +such as \textit{``Python is more high-level than C''} is unlikely to +raise any objections, it is usually pointless to try to rank very +similar languages on this spectrum. + +\section{Compilers and Interpreters} + +As the computer natively understands only its machine code, other +languages must be \textit{translated} to machine code in order to run. +For historical reasons, this is called \textit{compilation}. We say +that a compiler takes as input a file with a \textit{source program}, +and produces a file containing an executable \textit{machine program} +that can be run directly. This is a very simplified model, for the +following reasons: + +\begin{enumerate} +\item Strictly speaking, a compiler does not have to produce machine + code. A compiler can also produce code in a different high level + languag. For example, with the rise of browsers, it has become + common to write compilers that produce JavaScript code. +\item The machine program normally cannot be \textit{directly} + executed, as modern systems have many layers of abstraction on top + of the processor. While the compiler does produce machine code, it + is usually stored in a special file format that is understood by the + \textit{operating system}, which is responsible for making the + machine code available to the processor. +\item The actual compiler contains many internal steps. Further, + large programs are typically not compiled all at once, but rather in + chunks. Typically, each \textit{source file} is compiled to one + \textit{object file}, which are finally \textit{linked} to form an + executable program. +\end{enumerate} + +While compilers are a fascinating subject in their own right, we will +discuss them only at a superficial level. For a more in-depth +treatment, you are encouraged to read a book such as Torben Mogensen's +\textit{Basics of Compiler + Design}\footnote{\url{http://hjemmesider.diku.dk/~torbenm/Basics/}}. + +In contrast, an \textit{interpreter} is a program that executes code +directly, without first translating it. The interpreter can be a +compiled program, or itself be interpreted. At the bottom level, we +always have a CPU executing machine code, but there is no fundamental +limit to how many layers of interpreters we can build on top. +However, the most common case is that the interpreter is a machine +code program, typically produced by a compiler. For example, Python +is an interpreted language, and the \texttt{python} interpreter +program used by most people is written in C, and compiled to machine +code. + +Interpreters are generally easier to construct than compilers, +especially for very dynamic languages such as Python. The downside is +that code that is interpreted generally runs much slower than machine +code. This is called the \textit{interpretive overhead}. When a C +compiler encounters an integer expression \texttt{x + y}, then this +can likely be translated to a single machine code +instruction---possibly preceded by instructions to read \texttt{x} and +\texttt{y} from memory. In contrast, whenever the Python interpreter +encounters this expression, it has to analyse it and figure out what +is supposed to happen (integer addition), and then dispatch to an +implementation of that operation. This is usually at least an order +of magnitude slower than actually doing the work. This means that +interpreted languages are usually slower than compiled languages. +However, many programs spend most of their time waiting for user +input, for a network request, or for data from the file system. Such +programs are not greatly affected by interpretive overhead. + +As an example of interpretive overhead, let us try writing programs +for investigating the Collatz conjecture. The Collatz conjecture +states that if we repeatedly apply the function +\[ + f(n) = + \begin{cases} + \frac{n}{2} & \text{if}~n~\text{is even} \\ + 3n+1 & \text{if}~n~\text{is odd} + \end{cases} +\] +to some initial number greater than $1$, then we will eventually reach +$1$. To investigate this function, the Python program +\texttt{collatz.py} in \cref{lst:pycollatz} takes an initial $k$ from +the user, then for every $1\leq n/dev/null + +real 0m1.368s +user 0m1.361s +sys 0m0.007s +\end{lstlisting} + +The \texttt{real} measurement tells us that the program took a little +more than $1.3s$ to run in the real world (we'll talk about the +difference between \texttt{user} and \texttt{sys} in +\cref{chap:openmp}). + +Now let us consider the same program, but written in C, which we call +\texttt{collatz.c}, and is shown in \cref{lst:ccollatz}. + +\begin{figure} +\lstinputlisting[ +caption={A C program for investigating the Collatz conjecture.}, +label={lst:ccollatz}, +language=C, +frame=single] +{src/collatz.c} +\end{figure} + +C is a compiled language, so we have to compile \texttt{collatz.c}: + +\begin{lstlisting} +$ gcc collatz.c -o collatz +\end{lstlisting} + +And then we can run it: + +\begin{lstlisting} +$ time ./collatz 100000 >/dev/null + +real 0m0.032s +user 0m0.030s +sys 0m0.002s +\end{lstlisting} + +Only $0.032s$! This means that our C program is +\[ + \frac{1.368}{0.032} = 42.75 +\] +times faster than the Python program\footnote{This is the + \emph{speedup in latency}---a concept we will return to in + \cref{sec:speedup-in-latency}}. This is not unexpected. The ease +of use of interpreted languages comes at a significant overhead. + +\subsection{Advantages of interpreters} + +People implement interpreters because they are easy to construct, +especially for advanced or dynamic languages, and because they are +easier to work with. For example, when we are compiling a program to +machine code, the compiler discards information about the source code, +which makes it difficult to relate the generated machine code with the +code originally written by the human. This makes debugging harder, +because the connection between what the machine physically +\textit{does}, and what the programmer \textit{wrote}, is not +explicit. In contrast, an interpreter more or less executes the +program as written by the programmer, so when things go wrong, it is +easier to explain where in the source code the problem occurs. + +In practice, to help with debugging, good compilers can generate +significant amounts of extra information in order to let special +\textit{debugger} programs map the generated machine code to the +original source code. However, this does tend to affect the +performance of the generated code. + +Another typical advantage of interpreters is that they are +straightforwardly \textit{portable}. When writing a compiler that +generates machine code, we must explicitly write a code generator +every CPU architecture we wish to target. An interpreter can be +written once in a portable programming language (say, C), and then +compiled to any architecture for which we have a C compiler (which is +essentially all of them). + +As a rule of thumb, very high-level languages tend to be interpreted, +and low-level languages are almost always compiled. Unfortunately, +things are not always so clear cut in practice, and \textit{any} +language can in principle be compiled---it may just be very difficult +for some languages. + +\subsection{Blurring the lines} + +Very few production languages are \textit{pure} interpreters, in the +sense that they do no processing of the source program before +executing it. Even Python, which is our main example of an +interpreted language, does in fact compile Python source code to +Python \textit{bytecode}, which is a kind of invented machine code +that is then interpreted by the Python \textit{virtual machine}, which +is an interpreter written in C. We can in fact ask Python to show us +the bytecode corresponding to a function: + +\begin{lstlisting} +>>> import dis +>>> def add(a,b,c): +... return a + b + c +... +>>> dis.dis(add) + 2 0 LOAD_FAST 0 (a) + 2 LOAD_FAST 1 (b) + 4 BINARY_ADD + 6 LOAD_FAST 2 (c) + 8 BINARY_ADD + 10 RETURN_VALUE +\end{lstlisting} + +This is not machine code for any processor that has ever been +physically constructed, but rather an invented machine code that is +interpreted by Python's bytecode interpreter. This is a common design +because it is faster than interpreting raw Python source code, but it +is still much slower than true machine code. + +\subsubsection{JIT Compilation} + +An even more advanced implementation technique is +\textit{just-in-time} (JIT) compilation, which is notably used for +languages such as C\#, F\# and JavaScript. Here, the source program +is first compiled to some kind of intermediary bytecode, but this +bytecode is then further compiled \textit{at run-time} to actual +machine code. The technique is called just-in-time compilation +because the final compilation typically occurs on the user's own +machine, immediately prior to the program running. + +The main advantage of JIT compilation is that programs run much faster +than when interpreting bytecode, because we ultimately do end up +executing a machine code version of the program. Because JIT +compilation takes place while the program is running, it is also able +to inspect the actual run-time behaviour of the program and tailor the +code generation to the actual data encountered in use. This is useful +for highly dynamic languages, where traditional \textit{ahead-of-time} +(AOT) compilers have difficulty generating good code. In theory, a +JIT compiler can always be \textit{at least as good} as an AOT +compiler, but in practice, AOT compilers tend to generate better code, +as they can afford to spend more time on compilation. In practice, +JIT compilers are only used to compute those parts of the program that +are ``hot'' (where a lot of time is spent), and an interpreter is used +for the rest. This tends to works well in practice, due to the maxim +that 80\% of the run-time is spent in 20\% of the code. An AOT +compiler will not know which 20\% of the code is actually hot, and so +must dedicate equal effort to every part, while a JIT compiler can +measure the run-time behaviour of the program, and see where it is +worth putting in extra effort. + +The main downside of JIT compilation is that it is difficult to +implement. It has been claimed that AOT compilers are $10\times$ as +difficult to write as interpreters, and JIT compilers are $10\times$ +as difficult to write as AOT compilers. + +\section{Tombstone Diagrams} + +\input{tombstone.tex} + +Interpreters and compilers allow us to consider programs as input and +output of other programs. That is, they are \textit{data}. +\textit{Tombstone diagrams} (sometimes called \textit{T-diagrams}) are +a visual notation that lets us describe how a program is translated +between different languages (\textit{compiled}), and when execution +takes place (either through a software interpreter or a hardware +processor). They are not a completely formal notation, nor can they +express every kind of translation or execution, but they are useful +for gaining an appreciation of the big picture. + +As the most fundamental concept, we have programs, which are written +in some language. Suppose we have a sorting program written in +Python, which we draw as follows: + +\begin{center} +\tprog(\textit{Sort} in Py) +\end{center} + +This is an incomplete diagram, since it contains programs we have not +described how to execute. A machine that executes some language, say +x86 machine code is illustrated as a downward-pointing triangle: + +\begin{center} +\tmachine(x86) +\end{center} + +We can say that the Python program is executed on this machine, by +stacking them: + +\begin{center} + \begin{picture}(50,65) + \put(0, 25){\tprog(\textit{Sort} in Py)} + \put(10,0){\tmachine(x86)} + \end{picture} +\end{center} + +But this diagram is \textit{wrong} --- we are saying that a program +written in Python is running on a machine that executes only x86. +When putting together a tombstone diagram, we must ensure that the +languages of the components match. While on paper we can just assume +a Python machine, this is not very realistic. Instead, we use an +interpreter for Python, written in x86, which as a tombstone diagram +is drawn like this: + +\begin{center} + \tinter(Py in x86) +\end{center} + +We can then stack the Python program on top of the interpreter, which +we then stack on top of the x86 machine: + +\begin{center} + \begin{picture}(50,115) + \put(0, 75){\tprog(\textit{Sort} in Py)} + \put(10, 25){\tinter(Py in x86)} + \put(10,0){\tmachine(x86)} + \end{picture} +\end{center} + +But maybe we are actually running on an ARM machine (as can be found +in most phones), but still only have a Python interpreter in x86. As +long as we have an x86 interpreter written in ARM machine code, this +is no problem: + +\begin{center} + \begin{picture}(50,165) + \put(0, 125){\tprog(\textit{Sort} in Py)} + \put(10, 75){\tinter(Py in x86)} + \put(10, 25){\tinter(x86 in ARM)} + \put(10,0){\tmachine(ARM)} + \end{picture} +\end{center} + +There is no limit to how tall we can stack interpreters. All that +matters is that at the end, we have either a machine that can run the +implementation language of the bottommost interpreter. Of course, in +practice, each level of interpretation adds overhead, so while +tombstone diagrams show what is \textit{possible}, they do not +necessarily show what is a good idea. Tall interpreter stacks mostly +occur in retrocomputing or data archaeology, where we are simulating +otherwise dead hardware. + +The diagrams above are a bit misleading, because the Python +interpreter is not actually written in machine code---it is written in +C, which is then translated by a compiler. With a tombstone diagram, +a compiler from C to x86, where the compiler is itself also written in +x86, is illustrated as follows: + +\begin{center} + \tcompiler(from C to x86 in x86) +\end{center} + +We can now put together a full diagram showing how the Python +interpreter is translated from C to x86, and then used to run a Python +program: + +\begin{center} + \begin{picture}(160,150) + \put(0, 50){\tinter(Py in C)} + \put(30,25){\tcompiler(from C to x86 in x86)} + \put(130, 50){\tinter(Py in x86)} + \put(120, 100){\tprog(\textit{Sort} in Py)} + \put(65,0){\tmachine(x86)} + \put(130,25){\tmachine(x86)} + \end{picture} +\end{center} + +For a diagram to be valid, every program, interpreter, or compiler, +must either be stacked on top of an interpreter or machine, or must be +to the left of a compiler, as with the Python interpreter above. + +Compilers are also just programs, and must either be executed directly +by an appropriate machine, or interpreted. For example, the following +diagram shows how to run a C compiler in Python, on top of a Python +interpreter in x86 machine code: + +\begin{center} + \begin{picture}(100,125) + \put(0,75){\tcompiler(from C to x86 in Py)} + \put(35,25){\tinter(Py in x86)} + \put(35,0){\tmachine(x86)} + \end{picture} +\end{center} + +How the Python interpreter has been obtained, whether written by hand +or compiled from another language, is not visible in the diagram. + +We can also use diagrams to show compilation pipelines that chain +multiple compilers. For example, programs written in the +Futhark\footnote{\url{https://futhark-lang.org}} programming language +are typically compiled first to C, and then uses a C compiler to +generate machine code, which we can then finally run: + +\begin{center} + \begin{picture}(295,95) + \put(0,50){\tprog(\textit{prog} in Fut)} + \put(40,25){\tcompiler(from Fut to C in x86)} + \put(130,50){\tprog(\textit{prog} in C)} + \put(170,25){\tcompiler(from C to x86 in x86)} + \put(260,50){\tprog(\textit{prog} in x86)} + \put(75,0){\tmachine(x86)} + \put(205,0){\tmachine(x86)} + \put(270,25){\tmachine(x86)} + \end{picture} +\end{center} + +Many compilers have multiple internal steps---for example, a C +compiler does not usually generate machine code directly, but rather +generates symbolic assembly code, which an \textit{assembler} then +translates to binary machine code. Typically tombstone diagrams do +not include such details, but we can include them if we wish, such as +with the Futhark compiler above. + +Tombstone diagrams can get awkward in complex cases (sometimes there +will be no room!), but they can be a useful illustration of complex +setups of compilers and interpreters. Also, if we loosen the +definition of ``machine'' to include ``operating systems'', then we +can use these diagrams to show how we can emulate Windows or DOS +programs on a GNU/Linux system. + +Tombstone diagrams hide many details that we normally consider +important. For example, a JIT compiler is simply considered an +interpreter in a tombstone diagram, since that is how it appears to +the outside. Also, tombstone diagrams cannot easily express programs +written in multiple languages, like the example shown in +\cref{sec:python-calling-c}. Always be aware that tombstone diagrams +are a very high-level visualisation. In practice, such diagrams are +mostly used for describing \textit{bootstrapping} processes, by which +we make compilers available on new machines. The tombstone diagram +components are summarised in \cref{fig:tombstone}. + +\begin{figure}[b] + \centering + + \begin{subfigure}[b]{0.45\textwidth} + \centering + \tprog(Prog in $L$) + \caption{A program written in language + $L$.} + \label{fig:tprog} + \end{subfigure} + \hfill + \begin{subfigure}[b]{0.45\textwidth} + \centering + \tinter($F$ in $T$) + \caption{An interpreter for $F$, written in $T$.} + \label{fig:tinter} + \end{subfigure} + + \bigskip + + \begin{subfigure}[b]{0.45\textwidth} + \centering + \tmachine($L$) + \caption{A machine that runs $L$ programs.} + \label{fig:tmachine} + \end{subfigure} + \hfill + \begin{subfigure}[b]{0.45\textwidth} + \centering + \tcompiler(from $F$ to $T$ in $L$) + \caption{A compiler from $F$ to $T$, written in $L$.} + \label{fig:tcompiler} + \end{subfigure} + + \caption{A summary of tombstone diagram building blocks.} + \label{fig:tombstone} +\end{figure} + +\section{Combining Python and C} +\label{sec:python-calling-c} + +As discussed above, interpreted languages are typically substantially +slower than compiled languages, especially for languages with high +\textit{computational intensity}. By this term, we mean how much of +the execution time is spent directly executing program code, and how +much is spent waiting for data (e.g. user input or network data). For +programs with low computational intensity, an interpreted language +like Python is an excellent choice, as the interpretive overhead has +little impact. However, Python is also very widely used for +computationally heavy programs, such as data analysis. Do we just +accept that these programs are much slower than a corresponding +program written in C? Not exactly. Instead, we use high-performance +languages, such as C, to write \textit{computational kernels} in the +form of C functions. These C functions can then be called from Python +using a so-called \textit{foreign function interface} (FFI). + +As a simple example, let us consider the \texttt{collatz.c} program +from \cref{lst:ccollatz}. Instead of compiling the C program to an +executable, we compile it to a so-called \textit{shared object}, which +allows it to be loaded by Python\footnote{Don't worry about the + details of the command line options here---the technical details are + less important than the concept.}: + +\begin{lstlisting} +$ gcc collatz.c -fPIC -shared -o libcollatz.so +\end{lstlisting} + +We can now write a Python program that uses the \texttt{ctypes} +library to access the compiled code in the \texttt{libcollatz.so} +file, and call the \texttt{collatz} function we wrote in C: + +\lstinputlisting[ +caption={A Python program that uses a C implementation of \texttt{collatz}.}, +label={lst:pycollatz-ffi}, +language=Python, +frame=single] +{src/collatz-ffi.py} + +Let's time it as before: + +\begin{lstlisting} +$ time python3 ./collatz-ffi.py 100000 >/dev/null + +real 0m0.165s +user 0m0.163s +sys 0m0.003s +\end{lstlisting} + +The pure Python program ran in $1.3s$, the pure C in $0.032s$, and +this mixture in $0.165s$ - significantly faster than Python, but +slower than C by itself. The difference is mostly down to the work +required to convert Python values to C values when calling +\texttt{c\_lib.collatz}. The overhead is particularly acute for this +program, because each call to \texttt{collatz} does relatively little +work. + +While this example is very simple, the basic idea is +\textit{fundamental} to Python's current status as perhaps the most +popular language for working data scientists and students. Ubiquitous +libraries such as NumPy and SciPy have their computational core +written in high-performance C or Fortran, which is then exposed in a +user-friendly way through Python functions and objects. While a +program that uses NumPy is certainly much slower than a tightly +optimised C program, it is \textit{much} faster than a pure Python +program would be, and \textit{far} easier to write than a +corresponding C program. + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: "notes" +%%% End: diff --git a/data_layout.tex b/data_layout.tex new file mode 100644 index 0000000..02c28f7 --- /dev/null +++ b/data_layout.tex @@ -0,0 +1,334 @@ +\lstset{language=C} + +\chapter{Data Layout} + +One of the things that makes C a difficult programming language is +that it does not provide many built-in data types, and provides poor +support for making it convenient to work with new data types. In +particular, C has notoriously poor support for multi-dimensional +arrays. Given that multi-dimensional arrays are perhaps the single +most important data structure for scientific computing, this is not +good. In this chapter we will look at how we \textit{encode} +mathematical objects such as matrices (two-dimensional arrays) with +the tools that C makes available to us. One key point is that there +are often multiple ways to represent the same object, with different +pros and cons, depending on the situation. + +\section{Arrays in C} + +At the surface level, C does support arrays. We can declare a +$n\times{}m$ array as +\begin{lstlisting} +double A[n][m]; +\end{lstlisting} +and then use the fairly straightforward \lstinline{A[i][j]} syntax to +read a given element. However, C's arrays are a second-class language +construct in many ways: + +\begin{itemize} +\item They decay to pointers in many situations. +\item They cannot be passed to a function without ``losing'' their + size. +\item They cannot be returned from a function at all. +\end{itemize} + +In practice, we tend to only use language-level arrays in very simple +cases, where the sizes are statically known, and they are not passed +to or from functions. For general-purpose usage, we instead build our +own representation of multi-dimensional arrays, using C's support for +\textit{pointers} and \textit{dynamic allocation}. Since actual +machine memory is essentially a single-dimensional array, working with +multi-dimensional arrays in C really just requires us to answer one +central question: +\begin{center} + \textbf{How do we map a multi-dimensional index to a + single-dimensional index?} +\end{center} +Or to put it another way, representing a $d$-dimensional array in C +requires us to define a bijective\footnote{A bijective function is a + function between two sets that maps each element of each set to a + distinct element of the other set.} \textit{index function} +\begin{equation} + I : \mathbb{N}^{d} \rightarrow \mathbb{N} +\end{equation} + +The index function maps from our (mathematical, conceptual) +multi-dimensional space to the one-dimensional memory space offered by +an actual computer. This is sometimes also called \textit{unranking}, +although this is strictly speaking a more general term from +combinatorics. + +As an example, suppose we wish to represent the following $3\times{}4$ matrix in memory: + +\begin{equation} + \begin{pmatrix} + 11 & 12 & 13 & 14 \\ + 21 & 22 & 23 & 24 \\ + 31 & 32 & 33 & 34 + \end{pmatrix} +\label{eqn:matA} +\end{equation} + +We can do this in any baroque way we wish, but the two most common +representations are: + +\begin{description} +\item[\textbf{Row-major order},] where elements of each \textit{row} are + contiguous in memory: + + \begin{center} + \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|} + \hline + 11&12&13&14&21&22&23&24&31&32&33&34\\ + \hline + \end{tabular} + \end{center} + + with index function + \[ + (i,j) \mapsto i\times 4 + j + \] +\item[\textbf{Column-major order},] where elements of each \textit{column} + are contiguous in memory: + + \begin{center} + \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|} + \hline + 11&21&31&12&22&32&13&23&33&14&24&34\\ + \hline + \end{tabular} + \end{center} + + with index function + \[ + (i,j) \mapsto j\times 3 + i + \] +\end{description} + +The index functions are generalised on \cref{fig:indexfunctions}. +Note that the two representations contain the exact same values, so +they encode the same mathematical object, but in different ways. The +intuition for the row-major index function is that we first skip $i$ +rows ahead to get to the row of interest, then move $j$ columns into +the row. + +Row-major order is used by default in most programming languages and +libraries, but not universally so---the scientific language Fortran is +famously column-major. The NumPy library for Python uses row-major by +default (called \texttt{C} in Numpy), but one can explicitly ask for +arrays in column-major order (called \texttt{F}), which is sometimes +needed when exchanging data with systems that expect a different +representation. + +\begin{figure*} + \centering + + \begin{subfigure}[b]{0.45\textwidth} + \begin{equation} + (i,j) \mapsto i\times m + j \label{eqn:idx2row} + \end{equation} + \caption{Row-major indexing.} + \label{fig:rowmajor} + \end{subfigure} + \hfill + \begin{subfigure}[b]{0.45\textwidth} + \begin{equation} + (i,j) \mapsto j\times n + i \label{eqn:idx2col} + \end{equation} + \caption{Column-major indexing.} + \label{fig:colmajor} + \end{subfigure} + + \caption{Index functions for $n\times{}m$ arrays represented in + row-major and column-major order. For an example of why computer + scientists tend to prefer 0-indexing, try rewriting the above to + work with 1-index arrays instead.} + \label{fig:indexfunctions} +\end{figure*} + +\subsection{Implementation in C} + +Let's look at how to implement this in C. Let's say we wish to +represent the matrix from \cref{eqn:matA} in row-major order. Then we +would write the following (assuming \texttt{n=3}, \texttt{m=4}): + +\begin{lstlisting} +int *A = malloc(n*m*sizeof(int)); +A[0] = 11; +A[1] = 12; +... +A[11] = 34; +\end{lstlisting} + +Note that even though we \textit{conceptually} wish to represent a +two-dimensional array, the actual C type is technically a +single-dimensional array with 12 elements. If we when wish to index +at position $(i,j)$ we then use the expression \lstinline{A[i*4+j]}. + +Similarly, if we wished to use column-major order, we would program as +follows: + +\begin{lstlisting} +int *A = malloc(n*m*sizeof(int)); +A[0] = 11; +A[1] = 21; +... +A[11] = 34; +\end{lstlisting} + +To C there is no difference---and there is no indication in the types +what we intended. This makes it very easy to make mistakes. + +Note also how it is on \textit{us} to keep track of the sizes of the +array---C is no help. Don't make the mistake of thinking that +\lstinline{sizeof(A)} will tell you how big this array is---while C +will produce a number for you, it willindicate the size of a +\textit{pointer} (probably 8 on your machine). + +Some C programmers like defining functions to help them generate the +flat indexes when indexing arrays: + +\begin{lstlisting} +int idx2_rowmajor(int n, int m, int i, int j) { + return i * m + j; +} + +int idx2_colmajor(int n, int m, int i, int j) { + return j * n + i; +} +\end{lstlisting} + +Note how row-major indexing does not use the \texttt{n} parameter, and +column-major indexing does not use \texttt{m}. + +However, these functions do not on their own fully prevent us from +making mistakes. Consider indexing the \texttt{A} array from before +with the expression + +\begin{center} +\lstinline{A[idx2_rowmajor(n, m, 2, 5)]}. +\end{center} + +Here we are trying to access index $(2,5)$ in a $3\times{}4$ +array---which is conceptually an out-of-bounds access. However, by +the index function, this translates to the flat index +$2\times{}3+5=11$, which is in-bounds for the 12-element array we use +for our representation in C. This means that handy tools like +\texttt{valgrind} will not even be able to detect our mistake---from +C's point of view, we're doing nothing wrong! Things like this make +scientific computing in C a risky endeavour. We can protect ourselves +by using helper functions like those above, and augment them with +\texttt{assert} statements that check for problems: + +\begin{lstlisting} +int idx2_rowmajor(int n, int m, int i, int j) { + assert(i >= 0 && i < n); + assert(j >= 0 && j < m); + return i * m + j; +} +\end{lstlisting} + +We can still make mistakes, but at least now they will be noisy, +rather than silently reading (or corrupting!) unintended data. + +\subsection{Size passing} +\label{sec:sizepassing} + +With the previously discussed representation, a multidimensional array +(e.g. a matrix) is just a pointer to the data, along with metadata +about its size. The C language does not help us keep this metadata in +sync with reality. When passing one of these arrays to a function, we +must manually pass along the sizes, and we must get them right without +much help from the compiler. For example, consider a function that +sums each row of a (row-major) $n\times{}m$ array, saving the results +to an $n$-element output array: + +\lstinputlisting[ +caption={Summing the rows of a matrix.}, +label={lst:sumrows.c}, +language=C, +frame=single] +{src/sumrows.c} + +C gives us the raw building blocks of efficient computation, but we +must put together the pieces ourselves. We protect ourselves by +carefully documenting the data layout expected of the various +functions. For the \texttt{sumrows} function above, we would document +that \texttt{matrix} is expected to be a row-major array of size +$n\times{}m$. + +\subsection{Slicing} +\label{sec:slicing} + +In high-level languages like Python, we can use notation such as +\texttt{A[i:j]} to extract a \textit{slice} (a contiguous subsequence) +of an array, in this case of the elements from index \texttt{i} to +\texttt{j}. No such syntactical niceties are available in C, but by +using our knowledge of how arrays are physically laid out in memory, +we can obtain a similar effect in many cases. + +Suppose \texttt{V} is a vector of \texttt{n} elements, and we wish to +obtain a slice of the elements from index \texttt{i} to \texttt{j} +(the latter exclusive). In Python, we would merely write +\texttt{V[i:j]}. In C, we compute the size of the slice as +\begin{lstlisting} +int m = j - i; +\end{lstlisting} +and then compute a pointer to the start of the slice: +\begin{lstlisting} +double *slice = &V[i]; +\end{lstlisting} +Now we can treat \texttt{slice} as an \texttt{m}-element array that +just happens to use the same underlying storage as \texttt{V}. This +means that we must be careful not to deallocate \texttt{V} while +\texttt{slice} is still in use. + +Similarly, if \texttt{A} represents a matrix of size \texttt{n} by +\texttt{m} in row-major order, then we can produce a vector +representing the \texttt{i}th row as follows: +\begin{lstlisting} +double *row = &A[i*m]; +\end{lstlisting} +The restriction is that such slicing can only produce arrays whose +elements are \emph{contiguous} in memory. For example, we cannot +easily extract a column of a row-major array, because the elements of +a column are not contiguous in memory. If we wish to extract a +column, then we have to allocate space and copy element-by-element, in +a loop\footnote{There are more sophisticated array representations + that use \textit{strides} to allow array elements that are not + contiguous in memory---NumPy uses these, but they are outside the + scope of our course.}. + +\subsection{Even higher dimensions} + +The examples so far have focused on the two-dimensional case. +However, the notion of row-major and column-major order generalises +just fine to higher dimensions. The key distinction is that in a row-major +array, the \textit{last} dimension is contiguous, while for a +column-major array, the \textit{first} dimension is contiguous. For a +row-major array of shape $n_{0} \times{} \cdots \times{} n_{d-1}$, the +index function where $p$ is a $d$-dimensional index point is + +\begin{equation} + p \mapsto \sum_{0 \leq i < d} p_{i} \prod_{i i}. + +In what parallelisation is concerned, the main limiting factor are the +true dependencies---which correspond to an algorithmic +property---because the anti and output dependencies can be typically +eliminated by various techniques, as we shall see. + +\subsection{Loop notation and lexicographic ordering of iterations in a loop nest} + +In the following we are concerned with loop nests that consist of +\lstinline{for}-loops where the number of iterations (the \emph{trip + count}) can be determined when the loop is first entered. This is +the case when: +\begin{enumerate} +\item The loop counter is not modified inside the loop. +\item The loop condition is of the form $i < n$, where $i$ is the loop + counter and $n$ does not change during execution of the loop. +\item The loop counter is increased by a constant for each loop + iteration. +\end{enumerate} +Most \lstinline{for}-loops you have written probably satisfy these +conditions. + +In the following we will also assume that iterations in a loop +nest are represented by a vector, in which iterations numbers +are written down from the corresponding outermost to the innermost +loop in the nest, and are ordered \emph{lexicographically}---i.e., +are ordered consistently with the order in which they are executed +in the (sequential) program. This means that in the loop nest below: +\begin{lstlisting}[mathescape=true] +for (int i = 0; i < N; i++) + for (int j = 0; j < M; j++) + ... loop-nest body ... +\end{lstlisting} +iteration \texttt{$\vec{k}$=(i=2,j=4)} is smaller than iteration +\texttt{$\vec{l}$=(i=3,j=3)} (i.e., $\vec{k} < \vec{l}$), because the +second iteration of the outer loop is executed before the third +iteration of the outer loop, no matter what the iteration numbers are +executed for the inner loop (of index \texttt{j}). In essence the +iteration numbers of inner loops are only used to discriminate the +order in the cases in which all the outer-loop iterations are equal, +for example \texttt{$\vec{k}$=(i=3,j=3) $<$ + $\vec{l}$=(i=3,j=4)} + +\subsection{Dependency definition} +The precise definition of a dependency between two statements +located inside a loop nest is given below. + +\begin{definition}[Loop Dependency]\label{Loop-Dep} + There is a dependency from statement $S_1$ to statement $S_2$ + in a loop nest {\em if and only if} there exists loop-nest + iterations $\vec{k}$, $\vec{l}$ such that $\vec{k} \leq \vec{l}$ + and there exists an execution path from statement $S_{1}$ to + statement $S_{2}$ \emph{such that:} + \begin{description} + \item[1.] $S_{1}$ accesses some memory location $M$ in iteration $\vec{k}$, and + \item[2.] $S_{2}$ accesses the same memory location $M$ in iteration $\vec{l}$, and + \item[3.] one of these accesses is a write. + \end{description} + In such a case, we say that $S_1$ is the \emph{source} of the dependence, + and that $S_{2}$ is the \emph{sink} of the dependence, because $S_1$ is supposed + to execute before $S_2$ in the sequential program execution.\\ + Dependencies can be visually depicted by arrows pointing from the source + to the sink of the dependence. +\end{definition} + +The definition basically says that in order for a dependency to exist, +there must be two statements that access \emph{the same memory + location} and one of the accesses must be a write---two read +instructions to the same memory location do not cause a +dependency. The nomenclature denotes the statement that executes first +in the program order as \emph{the source} and the other as \emph{the + sink} of the dependency. We represent a dependency graphically with +an arrow pointing from the source to the sink. + +Optimisations for instruction-level parallelism (ILP)---meaning +eliminating as much as possible the stalls from processor's pipeline +execution\footnote{Outside the scope of HPPS.}---typically rely on +intra-iteration analyses (i.e., $\vec{k}=\vec{l}$). Higher-level +optimisations, such as detection of loop parallelism, are mostly +concerned with analysing inter-iteration dependencies (i.e., +$\vec{k}\neq\vec{l}$). For example the main aim could be to disprove +the existence of inter-iteration dependencies, such that different +iterations may be scheduled out of order (in parallel) on different +cores, while the body of an iteration is executed sequentially on the +same core. In such a context, intra-iteration dependencies are +trivially satisfied, and so are not very interesting. + +\subsection{Aggregating dependencies with direction vectors} + +Assume the three loops presented in \cref{fig:data-dep-running-eg}, +which will be used as running example to demonstrate data dependence +analysis and related transformations. We make the important +observation that the code is not in three-address code (TAC) form: a +statement such as \texttt{A[j][i] = A[j][i] + 3} would correspond to +three TAC or hardware instructions: one that loads from memory +\texttt{tmp1 = A[j][i]}, followed by one that performs the arithmetic +operation \texttt{tmp2 = tmp1 + 3}, followed by one that writes to +memory \texttt{A[j][i] = tmp2}. Automated analysis is for simplicity +usually carried out on programs in TAC form but, for brevity, our +analysis will be carried out at the statement level. +% ---however please remember that such a statement +% is actually a composition of three hardware instructions! +A human may start analysing dependencies: +\begin{itemize} +\item by depicting the iteration space in a rectangle in which the $x$ + axis and $y$ axis correspond to iteration numbers of the inner + loop $j$ and outer loop $i$, respectively, and +\item then by reasoning point-wise about what dependencies may + happen between two iterations. +\end{itemize} + +\begin{figure} + \centering + + \begin{subfigure}[b]{0.45\textwidth} +\begin{lstlisting}[language=C,mathescape=True] +for (int i = 0; i < N; i++) + for (int j = 0; j < N; j++) + $S_1:$ A[j][i] = A[j][i]... +\end{lstlisting} + \caption{} + \label{fig:data-dep-running-eg-a} + \end{subfigure} + + \begin{subfigure}[b]{0.45\textwidth} +\begin{lstlisting}[language=C,mathescape=True] +for (int i = 1; i < N; i++) + for (int j = 1; j < N; j++) { + $S_1:$ A[j][i] = A[j-1][i-1]... + $S_2:$ B[j][i] = B[j-1][i]... + } +\end{lstlisting} + \caption{} + \label{fig:data-dep-running-eg-b} + \end{subfigure} + + \begin{subfigure}[b]{0.45\textwidth} +\begin{lstlisting}[language=C,mathescape=True] +for (int i = 1; i < N; i++) + for (int j = 0; j < N; j++) + $S_1:$ A[i][j] = A[i-1][j+1]... +\end{lstlisting} + \caption{} + \label{fig:data-dep-running-eg-c} + \end{subfigure} + + \caption{Three simple running code examples that will be used to + demonstrate data-dependency analysis and related transformation. + Note that the statement labels $S_{i}$ are not part of the code as + such, but used to refer to specific statements in the text.} + \label{fig:data-dep-running-eg} +\end{figure} + +\begin{figure} + \includegraphics[width=\textwidth]{img/LoopDeps.pdf} + \caption{Graphical representation of the dependencies for the three running examples shown in \cref{fig:data-dep-running-eg}; the $x$ and $y$ axis correspond to the index of the inner and outer \lstinline{do} loop, respectively.} + \label{fig:dep-graph} +\end{figure} + +A graphical representation of the dependencies of the three running code +examples is shown in \cref{fig:dep-graph}. They can be intuitively inferred +as follows: +\begin{itemize} +\item For the loop in \cref{fig:data-dep-running-eg-a}, different + loop-nest iterations $(i_1,j_1)$ and $(i_2,j_2)$ necessarily read + and write different array elements \texttt{A[$j_1$][$i_1$]} and + \texttt{A[$j_2$][$i_2$]}. This is because our assumption is that + $(i_1,j_1) \neq (i_2,j_2)$, hence it cannot be that both $i_1 = i_2$ + and $j_1 = j_2$. As such, the representation of dependencies should + be a set of points (no arrows), meaning that all dependencies + actually occur inside the same iteration---in fact they are anti + intra-iteration dependencies (WAR) because \texttt{A[j][i]} is read + first and then \texttt{A[j][i]} is written inside the same + iteration. +\item For the loop in \cref{fig:data-dep-running-eg-b} we reason + individually for statements $S_1$ and $S_2$ because each statement + accesses (one) array \texttt{A} and \texttt{B}, respectively: + \begin{itemize} + \item[$S_1:$] Let's take an iteration, say $(i_1=2, j_1=3)$, + which \emph{reads} element \texttt{A[$j_1$-1][$i_1$-1] = A[2][1]}. + Since an iteration $(i,j)$ always writes the element + \texttt{A[i][j]}, we can reason that iteration $(i_2=1, j_2=2)$ + will \emph{write} the same element \texttt{A[2][1]}. + It follows that we have discovered a true (RAW) dependency, + depicted in the figure with and arrow, from the source + iteration $(i_2=1, j_2=2)$---which writes \texttt{A[2][1]}---to + the sink iteration $(i_1=2, j_1=3)$---which reads \texttt{A[2][1]}. + This is because iteration $(1,2) < (2,3)$ according to the + lexicographical ordering, and as such, the read happens after + the write (RAW) in program order. One can individually reason + for each point of the iteration space and fill it with oblique, + forward-pointing arrows denoting true dependencies between + different instances of statement $S_1$ + (executing in different iterations). + \item[$S_2:$] Following a similar rationale, iteration $(i_1=2, j_1=3)$ + \emph{reads} element \texttt{B[$j_1$-1][$i_1$] = B[2][2]}, and + iteration $(i_2=2, j_2=2)$ \emph{writes} element \texttt{B[2][2]}. + It follows that we have discovered a true (RAW) dependency + with source $(i_2=2, j_2=2)$ and sink $(i_1=2, j_1=3)$, + because $(2,2) < (2,3)$ in lexicographic ordering. + Since $i_1=i_2$ we depict the arrow parallel with the + horizontal axis (that depicts values of $j$). One can + fill in the rest of the iteration space with horizontal arrows. + \end{itemize} + +\item For the loop in \cref{fig:data-dep-running-eg-c} we reason in a + similar way: take iteration $(i_1=2, j_1=3)$ that \emph{reads} + element \texttt{A[2-1][3+1] = A[1][4]}. This element is + \emph{written} by iteration $(i_2=1,j_2=4)$. It follows that we have + discovered a true (RAW) from source $(i_2=1,j_2=4)$ to sink + $(i_1=2, j_1=3)$---because the read happens in iteration $(2,3)$ + which comes after the write in iteration $(1,4)$, i.e., + $(1,4) < (2,3)$. Thus, one can fill in the iteration space with + oblique, backward-pointing arrows, denoting true dependencies + between instances of $S_1$ executing in different iterations. +\end{itemize} + +We have applied above a human type of reasoning and, as a result, we +have a graphical representation of all dependencies. However, such a +reasoning is not suitable for automation because (i) the loop counts +are statically unknown---they depend on the dataset---hence one cannot +possibly represent an arbitrary large iteration space, and, more +importantly, (ii) even if the loop counts would be statically known it +is still inefficient to maintain and work with all this pointwise +information. + +A representation that promotes automated reasoning should succinctly +capture the repeated pattern in the figure. Intuitively and +imprecisely, for \cref{fig:data-dep-running-eg-a} the pattern would +correspond to a point, for \cref{fig:data-dep-running-eg-b} it would +correspond to two arrows---one oblique and one horizontal forward +pointing arrows---and for \cref{fig:data-dep-running-eg-c} it would +correspond to an oblique, backward-pointing arrow. +% +These patterns are formalized by introducing the notion of +\emph{direction vectors}. + +\begin{definition}[Dependency direction vector]\label{Dep-Dir-Vect} + + Assume there exists a dependency with source $S_{1}$ in iteration $\vec{k}$ + to sink $S_{2}$ in iteration $\vec{l}$ ($\vec{k}\leq\vec{l}$). + We denote by $m$ the depth of the loop nest, we use $i$ to range + from $0,\ldots,m-1$, and we denote by $x_i$ the $i^{th}$ + element of some vector $\vec{x}$ of length $m$. + + The \emph{direction vector} between the instance of statement $S_1$ + executed in some source iteration $\vec{k}$ and statement $S_2$ + executed in sink iteration $\vec{l}$ is denoted by + $\vec{D}(S_1\in\vec{k},S_{2}\in\vec{l})$, and corresponds to a vector + of length $m$, whose elements are defined as: + + \begin{equation} + D_i(S_1\in\vec{k},S_2\in\vec{l})= + \begin{cases} + \mbox{\texttt{<}} \ \ \ \ \mbox{\textrm{if it is provably that}} ~~k_i ~<~ l_i,\\ + \mbox{\texttt{=}} \ \ \ \ \mbox{\textrm{if it is provably that}} ~~k_i ~=~ l_i,\\ + \mbox{\texttt{>}} \ \ \ \ \mbox{\textrm{if it is provably that}} ~~k_i ~>~ l_i,\\ + \mbox{\texttt{{}*}} \ \ \ \ \mbox{\textrm{if}}~k_i ~\mbox{\textrm{{}and}}~ l_i~\mbox{are statically uncomparable.}\\ + \end{cases} + \label{eqn:dir-vec} + \end{equation} +\end{definition} + +The first three cases of the definition above assume that the ordering +relation between $k_i$ and $l_i$ can be statically derived in a generic +fashion (for any source $k_i$ and $l_i$); if this is not possible than +we use the notation \texttt{*} which \emph{conservatively} assumes that any +directions may be possible---i.e., star should be understood as +simultaneous existence of all \texttt{<, =, >} directions. For example, +the loop +\begin{lstlisting}[mathescape=true] +for (int i = 0; i < N; i++) + $S_1:$ A[ X[i] ] = ... +\end{lstlisting} +would result in direction vector \texttt{[*]} corresponding to a +potential output dependency (WAW), because the write access to +\texttt{A[ X[i] ]} is statically unanalysable---for example under the +assumption that the index array \texttt{X} is part of the +dataset---and, as such, all direction vectors may possibly hold +between various pairs of instances of statement $S_1$ executed in +different iterations. + +Note that the symbols \texttt{<, =, >} are \emph{not} connected at all +to the type of the dependency, e.g., true (RAW) or anti (WAR) +dependency. The type of the dependency is solely determined by the +operation of the source and that of the sink: If the source is a write +statement and the sink is a read then we have a true (RAW) dependency; +if the source is a read and the sink is a write then we have an anti +(WAR) dependency; if both source and sink are writes then we have an +output (WAW) dependency. + +The meaning of the symbol \texttt{>} at some position $i$ is that +the source iteration at loop-level $i$ is greater than the sink +iteration at loop-level $i$. This case is possible, for example +the code in \cref{fig:data-dep-running-eg}(c) shows a dependency +with source iteration $(1,4)$ and sink iteration $(2,3)$. At the +level of the second loop, we have $4 > 3$ hence the direction +is \texttt{>} but still the source iteration is less than the sink +iteration $(1,4) < (2,3)$ because of the first loop level. +This observation leads to the following corollary: + +\begin{corollary}[Direction vector legality]\label{Leg-Dir-Vect} + + A direction vector is legal (well formed), if removing the + \texttt{=} entries does \emph{not} result in a leading \texttt{>} + symbol, as this would mean that an iteration depends on a future + iteration, and depending on a future event is considered impossible, + and as such illegal. +\end{corollary} + +It remains to determine the sort of reasoning that can be applied to +compute the direction vectors for the code examples in +\cref{fig:data-dep-running-eg}: +\begin{description} +\item[The loop in \cref{fig:data-dep-running-eg-a}:] dependencies can + occur only between instances of statement $S_1$, executed in + different (or the same) iterations. We recall that, by the + definition of dependency, the two (dependent) iterations must access + the same element of \texttt{A} and at least one iteration should + perform a write. Since statement $S_1$ performs a read and a write + to elements of array \texttt{A}, two kinds of dependencies may + occur: + \begin{description} + \item[WAW:] an output dependency may be caused by two write accesses + in two different iterations, denoted $(i_1,j_1)$ and + $(i_2,j_2)$. The written element is thus \texttt{A[j$_1$][i$_1$]}, + which must be the same as \texttt{A[j$_2$][i$_2$]} for a + dependency to exist. By \cref{eqn:dir-vec} this results in the + system of equations + \[ + \begin{cases}i_1 = i_2\\j_1 = j_2\end{cases} + \] + which leads to direction vector \texttt{[=,=]}. Hence, an output + dependency from $S_1$ to $S_1$ happens in the same iteration, but + statement $S_1$ executes only one write access in the same + iteration. The conclusion is that no output dependency can occur, + hence the direction vector is discarded. + \item[RAW:] a true or anti dependency---we do not know yet + which---will be caused by the read access from \texttt{A} and the + write access to \texttt{A} in different (or same) + iterations. Remember that a statement such as \texttt{A[j][i] = + A[j][i] + 3} actually corresponds to three hardware + instructions, hence either a cross- or an intra-iteration + dependency will necessarily occur. Assume some iteration + $(i_1, j_1)$ reads from \texttt{A[j$_1$][i$_1$]} and iteration + $(i_2,j_2)$ writes to \texttt{A[j$_2$][i$_2$]}. In order for a + dependency to exist, the memory location of the read and write + must coincide; this results in the system of equations + \[ + \begin{cases}i_1 = i_2\\j_1 = j_2\end{cases} + \] + from which we can derive the direction vector: + \texttt{[=,=]}. This implies that the dependency happens in the + same iteration, hence it is an intra-iteration dependency. + Furthermore, since the write follows the read in the instruction + order of an iteration, this is an anti dependency (WAR). + \end{description} + +\item[For the loop in \cref{fig:data-dep-running-eg-b}:] dependencies + may possibly occur between instances of statement $S_1$ + and between instances of statement $S_2$. The case of + output dependencies is disproved by a treatment similar + to the bullet above. It remains to examine the dependency + caused by a read and a write in different instances of + $S_1$ and $S_2$, respectively: + \begin{description} + \item[$S_1$:] assume iteration $(i_1,j_1)$ and iteration $(i_2,j_2)$ + reads from and writes to the same element of \texttt{A}, + respectively. Putting this in \cref{eqn:dir-vec} results in the + system + \[ + \begin{cases}i_1-1 = i_2\\j_1-1 = j_2\end{cases} + \] + which necessarily means that $i_1 > i_2$ and $j_1 > j_2$. + However, we do not know yet which iteration is the source and + which is the sink. Assuming that $(i_1, j_1)$ is the source + results in the direction vector \texttt{[>,>]}, which is illegal + by \cref{Leg-Dir-Vect}, because a direction vector cannot start + with the \texttt{>} symbol. It follows that our assumption was + wrong: $(i_2, j_2)$ is the source and $(i_1, j_1)$ is the sink, + which means that this is a cross-iteration true dependency + (RAW)---because the sink iteration reads the element that was + previously written by the source iteration---and its direction + vector is \texttt{[<,<]}. + \item[$S_2$:] a similar rationale can be applied to + determine that two instances of $S_2$ generate + a true cross-iteration dependency (RAW), whose + direction vector is \texttt{[=,<]}. In short, using the + same notation results in the system of equations + \[ + \begin{cases}i_1 = i_2\\j_1-1 = j_2\end{cases} + \] + hence the source must be $(i_2,j_2)$ and the sink must be + $(i_1,j_1)$ and the direction vector is \texttt{[=,<]}. + \end{description} + +\item[For the loop in \cref{fig:data-dep-running-eg-c}:] dependencies + may possibly occur between instances of statement $S_1$. Assume + iteration $(i_1,j_1)$ and $(i_2,j_2)$ reads from and writes to the + same element of \texttt{A}, respectively. Putting this into + \cref{eqn:dir-vec} results in the system + \[ + \begin{cases}i_1-1 = i_2\\j_1+1 = j_2\end{cases} + \] + which necessarily implies that $i_1 > i_2$ and $j_1 < j_2$. Choosing + $(i_1,j_1)$ as the source of the dependency results in direction + vector \texttt{[>,<]}, which is illegal because it has \texttt{>} as + the first non-\texttt{=} outermost symbol, as stated by + \cref{Leg-Dir-Vect}. It follows that $(i_1,j_1)$ must be the sink + and $(i_2,j_2)$ must be the source, which results in the direction + vector \texttt{[<,>]}, which is legal. Since the source writes and + the sink reads, we have a true dependency (RAW). Moreover since the + direction vector indicates that the source iteration is strictly + less than the sink iteration, this is also a cross-iteration + dependency. +\end{description} + +\begin{definition}[Dependency direction matrix]\label{Dep-Dir-Mat} + A direction matrix is obtained by stacking together the + direction vectors of all the intra- and cross-iteration + dependencies of a loop nest (i.e., between any possible + pair of write-write or read-read instruction instances). +\end{definition} + +In conclusion, the direction matrices for the three running code +examples: + +\begin{description} +\item[\Cref{fig:data-dep-running-eg-a}:] + $\begin{cases}\texttt{[=,=]}\end{cases}$ +\item[\Cref{fig:data-dep-running-eg-b}:] + $\begin{cases}\texttt{[<,<]}\\\texttt{[=,<]}\end{cases}$ +\item[\Cref{fig:data-dep-running-eg-c}:] + $\begin{cases}\texttt{[<,>]}\end{cases}$ +\end{description} + +The following sections will show how the legality of powerful code +transformations can be reasoned in a simple way in terms of direction +vectors/matrices. + +\section{Determining loop parallelism} +\label{sec:loop-par} + +A loop is said to be parallel if its execution does not cause any +(true, anti, or output) dependencies between iterations---the loop +execution is assumed to be fixed in a specific iteration of an +(potentially empty) enclosing loop context. + +The following theorem states that a sufficient condition for a loop to +be parallel is that for all the elements in the loop's corresponding +direction-matrix column, it holds that the element is either +\texttt{=} or there exists an outer loop whose corresponding direction +is \texttt{<} (on that row). In the latter case we say that the outer +loop \emph{carries} all the dependencies of the inner loop, i.e., +fixing an iteration of the outer loop (think executing the outer loop +sequentially) would guarantee the absence of cross-iteration +dependencies in the inner loop. + +\begin{theorem}[Parallel loop]\label{Loop-Par} + + We assume a loop nest denoted by $\vec{L}$, whose direction matrix + is denoted by $M$ and consists of $m$ rows. A sufficient condition + for a loop at depth $k$ in $\vec{L}$, denoted $L_k$, to be parallel + is that $\forall i\in\{0,\ldots m-1\}~$ either $M[i][k]$ is + \texttt{=} or there exists an outer loop at depth $q} may be +possible. If \texttt{*} does not appear in the direction matrix, then +the condition becomes \emph{necessary} as well as sufficient. Let us +analyse the parallelism of each loop in our running examples: + +\begin{description} +\item[\Cref{fig:data-dep-running-eg-a}:] The direction matrix is + \texttt{[=,=]}, hence by \cref{Loop-Par}, both loops in the nest are + parallel because all the directions are \texttt{=}. + +\item[\Cref{fig:data-dep-running-eg-b}:] The direction matrix is + \[ + M = \begin{cases}\texttt{[<,<]}\\\texttt{[=,<]}\end{cases} + \] + hence neither the outer nor the inner loop can be proven parallel by + \cref{Loop-Par}. In the former case this is because $M[0,0]$ is + \texttt{<} and there is no other outer loop to carry + dependencies. In the latter case this is because $M[1,1]$ is + \texttt{<} and the outer loop for that row has direction \texttt{=} + (instead of \texttt{<}, which would have been necessary to carry the + dependencies of the inner loop). + +\item[\Cref{fig:data-dep-running-eg-c}:] The direction matrix is + \texttt{[<,>]}, which means that the outer loop is not + parallel---because it has a leading \texttt{<} direction---\emph{but + the inner loop is parallel} because the outer loop starts with + \texttt{<} on the only row of the direction matrix, and, as such, it + carries all the dependencies of the inner loop. To understand what + this means, take a look again at the actual code in + \cref{fig:data-dep-running-eg-c}. Suppose we fix the outer + iteration number to some value $i$. Then the read accesses always + refer to row $i-1$ of matrix \texttt{A} and the write accesses + always refer to row $i$ of \texttt{A}; hence a cross-iteration + dependency cannot happen in the inner loop because no matter the + value of $j$, the read and write statement instances cannot possibly + refer to the same location of \texttt{A}. +\end{description} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: "notes" +%%% End: diff --git a/img/Makefile b/img/Makefile new file mode 100644 index 0000000..212aa1b --- /dev/null +++ b/img/Makefile @@ -0,0 +1,4 @@ +all: amdahl.pdf gustafson.pdf + +%.pdf: %.gplot + gnuplot $^ diff --git a/img/amdahl.gplot b/img/amdahl.gplot new file mode 100644 index 0000000..75d9e0d --- /dev/null +++ b/img/amdahl.gplot @@ -0,0 +1,14 @@ +set output "amdahl.pdf" +set terminal pdf +set key left top +set xlabel "Number of processors" +set ylabel "Speedup" +amdahl(p,N) = 1/((1-p)+(p/N)) +plot [1:300] \ + amdahl(0.99,x) title "p=0.99" lw 3, \ + amdahl(0.98,x) title "p=0.98" lw 3, \ + amdahl(0.97,x) title "p=0.97" lw 3, \ + amdahl(0.96,x) title "p=0.96" lw 3, \ + amdahl(0.95,x) title "p=0.95" lw 3, \ + amdahl(0.90,x) title "p=0.90" lw 3, \ + amdahl(0.50,x) title "p=0.50" lw 3 diff --git a/img/gustafson.gplot b/img/gustafson.gplot new file mode 100644 index 0000000..048c0fe --- /dev/null +++ b/img/gustafson.gplot @@ -0,0 +1,14 @@ +set output "gustafson.pdf" +set terminal pdf +set key left top +set xlabel "Number of processors" +set ylabel "Speedup" +gustafson(p,N) = N + ((1-N)*(1-p)) +plot [1:300] \ + gustafson(0.99,x) title "p=0.99" lw 3, \ + gustafson(0.98,x) title "p=0.98" lw 3, \ + gustafson(0.97,x) title "p=0.97" lw 3, \ + gustafson(0.96,x) title "p=0.96" lw 3, \ + gustafson(0.95,x) title "p=0.95" lw 3, \ + gustafson(0.90,x) title "p=0.90" lw 3, \ + gustafson(0.50,x) title "p=0.50" lw 3 diff --git a/img/highleveltweet.png b/img/highleveltweet.png new file mode 100644 index 0000000..d80bf91 Binary files /dev/null and b/img/highleveltweet.png differ diff --git a/network.tex b/network.tex new file mode 100644 index 0000000..54fb25f --- /dev/null +++ b/network.tex @@ -0,0 +1,154 @@ +\chapter{Networks} + +Although there are many kinds of computer networks created, the vast majority of network equipment is based on the TCP/IP stack, which we will cover in these notes. The course book covers networks from a programmer’s perspective, inspecting details on how to use the C programming language to implement network applications. + +In most (scientific) applications this is far too low-level. The course notes focus instead on explaining how the network is implemented and the properties that arise from the implementation. The focus is on explaining the properties that are most likely to impact scientific applications. + +\section{OSI layers} +The OSI layers are a model that explains the implementation of the internet. Each of the layers in the OSI model relies on the previous layer and makes certain assumptions about it. Using the services provided by the underlying layer, each layer can add new services. + +The OSI model is intended to describe many kinds of networks, but in the text here we map the concepts to the implementation on global internet and consider this the only implementation. + +\section{Physical layer} +The physical layer is the lowest layer in the OSI model and is concerned with transmission of bits via various propagation media. The first networks were implemented as wired networks, propagating electrical signals through a shared copper wire. Modern implementations use radio signals, to implement services such as Wi-Fi and 3/4/5G. + +The physical layer provides the service of sending and receiving bits. It is important to note that due to properties of the propagation medium, the timings, bandwidth, and error correction properties are different. Each of the layers above needs to accept this, such that the layers work similarly with different physical layers. + +\subsection{Sharing a medium} +The most significant work done by the physical layer is the ability to transmit messages on a shared medium, without any out-of-band control mechanism. The primary method for solving this is the use of a multiple-access protocol. For both (wired) Ethernet and Wi-Fi, the protocol is the carrier-sense multiple-access collision, CSMA, protocol. The protocol requires that each host can "sense" if data is being sent, and refrains from sending data which would cause a collision, making the data impossible to read. Since there is no out-of-band communication, collisions will eventually happen when two hosts communicate. The CSMA protocol uses a concept of exponential back-off with random starting values to ensure that collisions are eventually resolved. + +\subsection{Sharing with radio signals} +For radio-based networks, it is possible to have a situation where the base station (the Wi-Fi access point) is placed between two hosts, such that neither can sense the other host's signals. In this case the CSMA protocol is extended to use collision avoidance, as the detection is not always possible. The concept is explored in this animated video: \url{https://www.youtube.com/watch?v=iKn0GzF5-IU}. + +\subsection{Connecting: Hub} +Since the physical layer is about transmitting signals, and a shared media is supported, the physical layer allows multiple machines to be joined. For Ethernet (wired) networks, a hub can be used to physically connect multiple devices. With a network connected by a hub, the entire bandwidth for the network is shared with all hosts, resulting in poor scalability. If a network comprises 100 hosts and is using 100 Mbps Ethernet, each host will reach less than 1 Mbps if all hosts attempt to communicate at full speed. + +\section{Link layer} + +With a physical layer that is capable of transmitting bits, the link layer provides the option for addressing a single host. The bits are \emph{framed} to allow sending a number of bits. Since the physical layer is expected to be using a shared medium, the link layer assumes that all frames are broadcast to every host. + +\subsection{Addressing network cards: MAC} +To solve the problem of sending a frame to a particular host, the link layer adds \emph{media access control}, MAC, addresses. Every network device has a globally unique address that is typically burned into the device but can be changed in some cases. When a host wishes to communicate with another host, a frame is transmitted on the physical medium, containing the MAC address of the recipient and sender. + +The format of the MAC address is using six dual-digit hexadecimal numbers, for instance: \texttt{01:23:45:67:89:ab}. The special address \texttt{ff:ff:ff:ff:ff:ff} is used for broadcasting frames to all recipients. + +\subsection{Connecting: Switch} +To make larger networks more efficient, a switch can be used in place of a hub. Where the hub operates on the physical layer, a switch operates on the link layer and forwards frames. When a switch is powered on, it has no knowledge of the network. Each frame it receives, it will broadcast to all other ports with a cable plugged in. For each package it receives, it will record the sender MAC address as well as the originating port. Since each MAC address is globally unique, the switch can use this simple scheme to learn where to send a frame and can avoid broadcasting to the entire network. Note that this feature works without any configuration, and even supports networks of networks, as the switch will allow mapping multiple MAC addresses to a single port. If a host is moved to another port, there will be a short period where the switch will forward frames incorrectly, until it sees a package from the new port. + +\subsection{Switches can increase available bandwidth} +Once the network has levels, or just a large number of hosts, switches become a crucial component in ensuring performance of the network. More expensive switches can allow multiple parallel full-duplex data paths, such that multiple pairs of ports may use the full bandwidth. This allows the network to scale better with the addition of multiple hosts but may still have bottlenecks if there is cross-switch communication. + +\subsection{Transmission errors} +Since the physical layer \emph{may} have transmission errors, the link frames typically include a simple checksum that the receiver verifies. If the checksum is incorrect, or any parts of the frame are invalid, the frame is \emph{dropped} with no notification to the sender. + +\section{Network layer} +With a layer that supports sending frames, the network layer adds routing and global addressing. While we \emph{can} build larger networks with a switch, the idea of broadcasting becomes problematic for a global network. + +\subsection{Addressing hosts: IP address} +Because the MAC address is fixed, it is not usable for global routing. Imagine that a core router on the internet would need to keep a list of all MAC addresses for all machines in Europe. The size and maintenance of such a list would be an almost impossible task. + +To simplify routing, each host on the network layer has at least one IP address, used to transmit \emph{packages}. The IP addresses are assigned in a hierarchical manner, such that a geographic region has a large range, which is then divided into smaller and smaller ranges. At the bottom is the internet service provider, ISP, who owns one or more ranges. From these ranges they assign one or more IP addresses to their customers. + +The hierarchy is important when considering global-scale routing. Instead of having a core router that keeps track of all IPs for Europe, it can just know that one of the ports "eventually leads to Europe" and then assign a few IP ranges to that port. + +\subsection{IP, CIDR, networks and masks} +The IP addresses are 32 bit numbers, and usually presented as four 8-bit decimal numbers, called the dotted-decimal notation: \texttt{123.211.8.111}. The routing works locally by comparing one IP address with another, using bitwise \texttt{XOR}. By \texttt{XOR}'ing two IP addresses, any bits that are the same will be zero. The subnet mask then defines what bits are ignored, so we can use bitwise \texttt{AND} on the result. After these two operations, the result will only be all zero bits if two IP addresses are on the same subnet. + +As an example, consider the two IP addresses \texttt{192.168.0.8} and \texttt{192.168.0.37}. Given a subnet mask of \texttt{255.255.255.192}, we can apply the \texttt{XOR} operation to the two numbers and get \texttt{0.0.0.45}. When we apply the \texttt{AND} operation to the result we get \texttt{0.0.0.0}, meaning that the two hosts are on the same subnet. + +As the subnet masks are always constructed with leading zeroes (i.e. it is not valid to use \texttt{255.0.255.0}), we can simply count the number of leading bits in the mask. This gives the classless interdomain routing notation, where we supply the IP address and number of bits in a compact notation: \texttt{192.168.0.0/26}. This notation is equivalent to supplying an IP address and a subnet mask, as we can convert trivially between the two. + +\subsection{Connecting networks: Router} +If we consider a package sent from a home-user in Europe to a server in the USA, the package will initially be sent to the \emph{uplink} port of the routers, until it reaches a router that can cross the Atlantic ocean. After it has crossed the Atlantic ocean, it will reach more and more specific IP ranges until it arrives at the destination. This system allows each router to know only general directions, simplified as "one port for away, multiple closer". As mentioned in the video, routing is often done with "longest prefix matching", where multiple rules are stored as a tuple of: \texttt{(destination port, IP pattern in CIDR)}. The router can then sort the rules by the number of fixed bits (i.e. the \texttt{/x} part of the CIDR address), and use the first forwarding rule where the bits match. This approach allows the router to store broad "general rules" and then selectively change smaller ranges. As the only operations required are \texttt{XOR} + \texttt{AND}, it can be performed efficiently in hardware. + +\subsection{A router is a host} +The router device itself works on the network layer, such that it can read the IP address. It can be considered a specialized computer, because the first routers were simply ordinary computers with multiple network cards. Unlike the switch and hub, a router is visible on the network and is addressable with its own IP addresses. + +To function correctly, a router needs to know what ranges its ports are connected to, which requires manual configuration. As the network can also change, due to links being created and removed, as well as traffic changes and fluctuating transfer costs, the routers need to be dynamically updated. + +\subsection{Updating routing tables} +The dynamic updates are handled inside each owners’ network, and across the networks using various protocols, constantly measuring capabilities, traffic load, and costs. One of the protocols for communicating routes between network owners is called the Border Gateway Protocol and is unfortunately not secure yet. Bad actors can incorrectly advertise short and routes, which causes the networks to start sending all traffic over a particular link, either for disruption or eavesdropping purposes. Occasionally this also happens due to human errors, leaving parts of the internet unreachable for periods of time. + +For an overview of the layers until now, and the different components that connect them, there is an animated video here: \url{https://www.youtube.com/watch?v=1z0ULvg_pW8}. + +\section{Transport layer} +With a network layer that is capable of transmitting a package from one host to another, the transport layer provides one protocol for doing just that: User Datagram Protocol, UDP. + +\subsection{Adding ports} +A datagram in UDP adds only a single feature to the service provided by the network layer: ports. To allow multiple processes on a given host to use the network, UDP adds a 16-bit port number. When describing and address on the transport layer, the port number is often added to the IP address or hostname after a colon: \texttt{192.168.0.1:456}. The operating system kernel will use the IP address and port number to deliver packages to a particular process. However, since UDP uses the network layer, there is no acknowledgement of receipt and no signals if a package is lost. Because the routers update dynamically, it is also possible for packages to reach the destination via different routes, causing packages to arrive in a different order than they were sent. + +UDP can be acceptable for some cases, for instance game updates or video streams, where we would rather loose a frame than have a stuttering video. But for many other cases, such as transferring a file or dataset, the UDP service is not useful. + +\subsection{Transmission control protocol: TCP} +The Transmission Control Protocol, TCP, is the most widely used protocol and most often used with IP to form TCP/IP. The TCP protocol builds a reliable transfer stream on top of a the unreliable delivery provided by the network layer. The protocol itself is very robust, with understandable mechanics, but can require some trials to accept that it works in all cases. + +\subsection{Establishing a connection} +Since the network layer does not tell us if a package has been delivered, the TCP protocol uses \emph{acknowledgements}, ACKs, to report receipt of a package. Before a connection is establish, the client sends a special SYN message, and awaits an ACK, and sends an ACK. This exchange happens before any actual data is transmitted but allows for "piggy backing" data on the last ACK message. When designing an application, it is important to know that this adds an overhead for each established connection, and thus connections should preferably be reused. + +\subsection{A simple stop-and-go TCP-like protocol} +Once the connection is established, data can flow, but due to the network layers unreliability, we can receive packages out-of-order or lose them. For a simple protocol, we can just drop out-of-order packages, treating them as lost. The remaining problem has two cases: loss of package and loss of ACK. Since the sender cannot know which of these has occurred, it assumes the data is lost, and re-transmits it after a timeout has occurred while waiting for an ACK. If we prematurely hit a timeout, we will re-send a received package, but this is the same case as a lost ACK will produce. + +The recipient can simply discard a package it has already received, so if we add a package number, called a sequence number\footnote{In the text and the video, we use a number pr. package. In the real TCP protocol, the sequence number counts bytes to support split packages.}, to the package, it is trivial for the recipient to know which packages are new and which are retransmits. It should be fairly simple to convince oneself that this works in all cases, in the sense that the recipient will eventually get the package, and the sender will eventually get an ACK. + +\subsection{Improving bandwidth with latency hiding} +However, due to the communication delay, we are waiting some of the time, instead of communicating. For even moderate communication delays, this results in poor utilization. To work around this, the TCP protocol allows multiple packages to be "in-flight", so the communication can occur with full bandwidth. This requires that the sender needs to keep track of which packages have gotten an ACK and which are still pending. We also need to keep copies of multiple packages, such that we can retransmit them if required. This does not change the way the simple stop-n-go protocol works, it simply adds a counter on the sender side. The choice of "how many" is done with a ramp-up process, increasing until no improvements are seen, which further adds to the delay of new connections. + +\subsection{Minimizing retransmission} +We could stop there, and be happy that it works, but if we get packages out-of-order it means not being able to send the ACK, and many in-flight packages needs to be retransmitted. This is solved in TCP by keeping a receive buffer, allowing packages to arrive in out-of-order. This does not change the protocol, except that the recipient needs to send an ACK, only when there are no "holes" in the sequence of received packages. This is simplified a bit in TCP, where an ACK is interpreted as vouching for all data up until the sequence number. This means that lost ACK messages are usually not causing disturbance, as a new one arrives shortly afterwards. But it also makes it easier for the recipient to just send an ACK for the full no-holes sequence. + +If we lose a package, the recipient will notice that it keeps getting packages with higher sequence numbers. Since it can only ACK the last one, it will keep doing so. The sender can then notice getting multiple ACKs (specifically: 3) for the same package, and guess that it needs to re-transmit the next package. This improvement makes it possible to transmit at full speed even with some package loss. + +Hopefully you can convince yourself that the protocol works correctly in all cases, despite the performance enhancements. + +\subsection{Flow control} +When designing and evaluating network performance, it is important to know the two complementary mechanisms that both end up throttling the sending of packages. The original throttling mechanism is called flow-control and works by having the recipient include the size of the receive buffer with each ACK message. The sender can monitor this value and reduce the sending rate if it notices that the remaining space is decreasing. Once the space is zero, the sender will wait for a timeout or an ACK with a non-zero receive buffer size. The timeout is a protection against the case where the ACK is lost. + +\subsection{Congestion control} +The other mechanism is congestion control, which is a built-in protection against overflowing the network itself. While any one machine cannot hope to overflow the core routers in the network, many hosts working together can. What happened in the early days of the internet was that some routers were overwhelmed and started dropping packages. The hosts were using TCP and responded by retransmitting the lost packages, causing a build-up of lost packages, to the point where no connection was working. The TCP protocols are implemented in software, so the TCP implementations were gradually updated with congestion control additions. Unlike flow-control, there is no simple way of reporting the current load of all routers in the path. Instead, congestion control monitors the responses from the client, and makes a guess of the network state. Various implementations use different metrics, where the loss of packages was once seen as the right indicator. However, since package loss occurs \emph{after} the congestion issues have started, this was later changed to measure the time between ACK messages. Once the routers start to get overloaded, the response time increases, and modern TCP implementations use this to throttle the sending speed. + +\subsection{Closing a connection} +The layers on top of TCP can assume that packages have arrived and are delivered in-order. But how do we know if we have received all packages, and not lost the last one? The TCP implementation handles this by sending a package with the FIN bit set. Like other packages, the FIN package can be lost, so we need to get an ACK as well. Again, the ACK package may be lost, so we need another package, and so on. TCP has a pragmatic solution, where the side that wishes to close the connection will send FIN, wait for an ACK and then send a final ACK. The final ACK \emph{may} be lost, in which case the recipient does not know if the sender has received the ACK. If this happens, TCP will keep the connection in a "linger" state for a period before using a timeout and closing the connection. Even if this happens, both sides can be certain that all messages have been echanged. + +\subsection{Network Address Translation} +In the original vision of the internet, all hosts were publicly addressable with their own IP address. Later that turned out to be a bad idea for security reasons, and for economic reasons. Each ISP has to purchase ranges of IP addresses and assign these to their customers. But some customers may have several devices, increasing the cost. Likewise, some customers and companies may like to have an internal network with printers and servers which is not exposed to the internet, but at the same time be able to access the internet. + +The technique that was employed rely on the router being a computer itself, with an internal and external IP address. When a host sends a package destined for the external network, the router will pick a random unused port number and forward the package, using the routers external IP address and the randomly chosen port. This information is stored in a table inside the router, such that when a response package arrives that is destined for the particular port and external IP, the router will forward it to the internal network, using the original IP and port. + +This operation is transparent to the hosts inside and outside the network, allowing ISP customers to have multiple internal hosts, sharing a single external IP address. + +When compared to the original vision for the internet, the NAT approach breaks with the assumption that each host needs a unique IP. In practice this means that a NAT'ed host can only initiate connections, it cannot receive new connections (i.e., it cannot be the server, only the client). + +In some cases, it might be desirable to use NAT, but still have a host act as a server. For this situation, most NAT capable routers allow pre-loading of static rules, for instance "external IP, port 80" should go to "10.0.0.4:1234". This is the same operation that happens automatically for outbound connections, but just always active. + +Although NAT is not strictly a security feature, it does provide some protection against insecurely configured machines being directly accessible from the internet. That is unless your NAT capable router has Universal Plug-n-Play, UPNP, which allows any program on your machine to request insert a preloaded NAT rule, exposing everything from printers to webcams without the owners knowledge. + +\subsection{IPv6} +In the above we have only considered IPv4, which is where the addresses are 32-bit. Despite the visions of giving every host their own IP address, the number of hosts in the world has long exceeded the available IPv4 numbers. Thanks mostly to NAT techniques, and a similar concept for ISPs called carrier-grade-NAT, this has not yet stopped the growth of the internet. + +Before it became apparent that NAT would ease some of the growing pains, the IPv6 standard was accepted and ratified. With IPv6 there are now 128 bits for an address, essentially allowing so many hosts, that it is unnecessary to have ports or NAT anywhere. + +Where the transport layer is implemented in software and easily updated, the network layer is embedded in devices with special-purpose chips. Changing these is costly and has so far dragged out for more than a decade. + +The number of IPv6 enabled hosts and routers continue to increase, but there are still many devices with a physical chip that cannot upgrade to IPv4. Many of these are IoT devices, attached to an expensive TV, surveillance cam or refrigerator. As the devices work fine for the owners, there is virtually no incentives for replacing it, leaving us with a hybrid IPv4 and IPv6 network for some time to come. + +New internet services would likely strive to use IPv4 addresses as there are plenty of ISP that only offer IPv4. If a company only has IPv6 servers, they would be inaccessible to a number of potential customers. + +Currently the most promising upgrade seems to be, once again, relying on NAT to perform transparent IPv4 to IPv6 translations. This could allow the internet as a whole to switch to IPv6 while also being accessible to IPv4 users. + +\section{Application layers} +With the transport layer providing in-order delivery guarantees for communicating between to processes, it opens up to a multitude of applications. The most popular one being the use of HTTP for serving web pages, but also many other services, including the network time protocol, which keeps your computer clock running accurately even though it is has low precision. + +\subsection{Domain Name System} +One application that runs on top of the transport layer is the domain name system, DNS. It is essentially a global key-value store for organizing values belonging to a given domain name. In the simplest case it is responsible for mapping a name, such as \texttt{google.com} to an IP address. This makes it easier for humans to remember, but also allows the owner of the domain name to change which host IPs are returned, without needing to contact the users. + +The design of the system is based on a hierarchical structure, where 13 logical servers replicate the root information. In practice these 13 servers are implemented on over 700 machines, geographically distributed over the entire globe. + +The root nodes store the IP addresses of the top-level domain servers, where a top-level domain could be \texttt{.com}. A top-level domain server, naturally replicated on multiple hosts, keeps a list of the name servers for each domain ending with the top-level domain (i.e. the top-level server for \texttt{.com} keeps track of \texttt{twitter.com}, \texttt{google.com}, etc). + +This means that each domain must run their own nameserver, but in practice, most nameservers are run by a set of registrars, where a small number of machines are responsible for thousands of domains. + +The domain name server is the final step, containing all information related to a given domain name. This server can be queried to obtain IP addresses for all subdomains (i.e., \texttt{www.google.com}, \texttt{docs.google.com}) as well as the domain itself (i.e., \texttt{google.com}). + +Apart from the IP addresses for hosts, the DNS records contain email servers, known as \texttt{MX} records, free text, called \texttt{TXT} records, and other domain related information. + +To keep the load on DNS servers down, each host in the DNS system will cache the values it receives for a period. The exact period is also stored in the DNS records with a time-to-live, \texttt{TTL}, value expressed in seconds. diff --git a/notes.bib b/notes.bib new file mode 100644 index 0000000..e8cd520 --- /dev/null +++ b/notes.bib @@ -0,0 +1,64 @@ +@inproceedings{CIVan, + author = {Oancea, Cosmin E. and Rauchwerger, Lawrence}, + title = {Scalable Conditional Induction Variables (CIV) Analysis}, + booktitle = {Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization}, + series = {CGO '15}, + year = {2015}, + isbn = {978-1-4799-8161-8}, + location = {San Francisco, California}, + pages = {213--224}, + numpages = {12}, + url = {http://dl.acm.org/citation.cfm?id=2738600.2738627}, + acmid = {2738627}, + publisher = {IEEE Computer Society}, + address = {Washington, DC, USA}, +} + +@article{kennedy2001optimizing, + title={Optimizing compilers for modern architectures: a dependence-based approach}, + author={Kennedy, Ken and Allen, John R}, + year={2001}, + publisher={Morgan Kaufmann Publishers Inc.} +} + +@inproceedings{ExtRed, + author = {B. Lu and J. Mellor-Crummey.}, + title = {Compiler {O}ptimization of {I}mplicit {R}eductions for {D}istributed {M}emory {M}ultiprocessors}, + booktitle = {Int. Par. Proc. Symp. (IPPS)}, + year = {1998} +} + +@article{10.1145/42411.42415, +author = {Gustafson, John L.}, +title = {Reevaluating {A}mdahl's Law}, +year = {1988}, +issue_date = {May 1988}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {31}, +number = {5}, +issn = {0001-0782}, +url = {https://doi.org/10.1145/42411.42415}, +doi = {10.1145/42411.42415}, +journal = {Commun. ACM}, +month = may, +pages = {532–533}, +numpages = {2} +} + +@inproceedings{10.1145/1465482.1465560, +author = {Amdahl, Gene M.}, +title = {Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities}, +year = {1967}, +isbn = {9781450378956}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/1465482.1465560}, +doi = {10.1145/1465482.1465560}, +abstract = {For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.}, +booktitle = {Proceedings of the April 18-20, 1967, Spring Joint Computer Conference}, +pages = {483–485}, +numpages = {3}, +location = {Atlantic City, New Jersey}, +series = {AFIPS '67 (Spring)} +} diff --git a/notes.tex b/notes.tex new file mode 100644 index 0000000..861341c --- /dev/null +++ b/notes.tex @@ -0,0 +1,72 @@ +\documentclass[oneside]{memoir} + +\setsecnumdepth{subsection} + +\usepackage[utf8]{inputenc} +\usepackage{ntheorem} +\usepackage{amsmath}[ntheorem] +\usepackage{amsfonts} +\usepackage{caption} +\usepackage{subcaption} +\usepackage{graphicx} +\usepackage{todonotes} +\usepackage{url} +\usepackage{hyperref} + +\usepackage{cleveref} + +\usepackage{couriers} +\usepackage{listings} +\lstset{basicstyle=\ttfamily} + +% Teach cleveref about listings. +\crefname{lstlisting}{listing}{listings} +\Crefname{lstlisting}{Listing}{Listings} + +\title{Course notes for High Performance Programming and Systems} +\author{Troels Henriksen, David Gray Marchant, Kenneth Skovhede} \date{\today} + +\newtheorem{definition}{Definition} +\newtheorem{corollary}{Corollary} +\newtheorem{theorem}{Theorem} + +\begin{document} + +\maketitle + +\section{Introduction} + +These notes are written to supplement the textbooks in the course +\textit{High Performance Programming and Systems}. Consider them +terminally in-progress. These notes are not a textbook, do not cover +the entire curriculum, and might not be comprehensible if isolated +from the course and its other teaching activities. + +\newpage +\tableofcontents + +\input{compiled_interpreted.tex} + +\input{data_layout.tex} + +\input{network.tex} + +\input{openmp.tex} + +\input{scaling.tex} + +\input{dependencies.tex} + +\input{transformations.tex} + +\newpage + +\bibliographystyle{plain} +\bibliography{notes} + +\end{document} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% End: diff --git a/openmp.tex b/openmp.tex new file mode 100644 index 0000000..077615f --- /dev/null +++ b/openmp.tex @@ -0,0 +1,482 @@ +\lstset{language=C} + +\chapter{OpenMP} +\label{chap:openmp} + +Writing multi-threaded programs using the raw POSIX threads API is +tedious and error-prone. It is also not portable, as POSIX threads is +specific to Unix systems. Also, writing efficient multi-threaded code +is difficult, as thread creation is relatively expensive, so we should +ideally write our programs to have a fixed number of \emph{worker + threads} that are kept running in the background, and periodically +assigned work by a scheduler. In many cases, particularly within +scientific computing, we do not need the flexibility and low-level +control of POSIX threads. We mostly wish to parallelise loops with +iterations that are \emph{independent}, meaning that they can be +executed in any order without changing the result. For such programs +we can use \emph{OpenMP}. We will use only a small subset of OpenMP +in this course. + +\section{Basic use of OpenMP} + +OpenMP is an extension to the C programming language\footnote{OpenMP + is not C specific, and is also in wide use for Fortran.} that allows +the programmer to insert high-level \emph{directives} that indicate +when and how loops should be executed in parallel. The compiler and +runtime system then takes care of low-level thread management. OpenMP +uses the the \emph{fork-join} model of parallel execution: a program +starts with a single thread, the \emph{master thread}, which runs +sequentially until the first \emph{parallel region} (such as a +parallel loop) is encountered. At this point, the master thread +creates a group of threads (``fork''\footnote{Note an unfortunate + mix-up of nomenclature: this has nothing to do with the Unix notion + of \texttt{fork()}, which creates \emph{processes}, not + \emph{threads}}), which execute the loop. The master thread waits +for all of them to finish (``join''), and then continues on +sequentially. This is called an \textit{implicit barrier}: a point +where execution pauses until all threads reach it. For example: + +\begin{lstlisting}[mathescape=true] +#pragma omp parallel for +for (int i=0; i +#include + +int collatz(int n) { + int i = 0; + while (n != 1) { + if (n % 2 == 0) { + n = n / 2; + } else { + n = 3 * n + 1; + } + i++; + } + return i; +} + +int main(int argc, char** argv) { + int k = atoi(argv[1]); + for (int n = 1; n < k; n++) { + printf("%d %d\n", n, collatz(n)); + } +} diff --git a/src/openmp-dotprod.c b/src/openmp-dotprod.c new file mode 100644 index 0000000..e8a7671 --- /dev/null +++ b/src/openmp-dotprod.c @@ -0,0 +1,34 @@ +double dotprod(int n, double *x, double *y) { + double sum = 0; +#pragma omp parallel for reduction(+:sum) + for (int i = 0; i < n; i++) { + sum += x[i] * y[i]; + } + return sum; +} + +#include "timing.h" +#include +#include + +int main(int argc, char** argv) { + assert(argc==2); + int n = atoi(argv[1]); + double *x = malloc(n*sizeof(double)); + double *y = malloc(n*sizeof(double)); + + for (int i = 0; i < n; i++) { + x[i] = (double)i/n; + y[i] = (double)i*2/n; + } + + double bef = seconds(); + double result = dotprod(n, x, y); + double aft = seconds(); + + free(x); + free(y); + + printf("Result: %f\n", result); + printf("Time (s): %f\n", aft-bef); +} diff --git a/src/openmp-example.c b/src/openmp-example.c new file mode 100644 index 0000000..ebf1d54 --- /dev/null +++ b/src/openmp-example.c @@ -0,0 +1,15 @@ +#include +#include + +int main(void) { + int n = 100000000; + + int *arr = malloc(n*sizeof(int)); + +#pragma omp parallel for + for (int i = 0; i < n; i++) { + arr[i] = i; + } + + free(arr); +} diff --git a/src/openmp-matadd.c b/src/openmp-matadd.c new file mode 100644 index 0000000..2f7fdd6 --- /dev/null +++ b/src/openmp-matadd.c @@ -0,0 +1,13 @@ +void matadd(int n, int m, + const double *x, const double *y, + double *out) { +#pragma omp parallel for collapse(2) + for (int i = 0; i < n; i++) { + for (int j = 0; j < m; j++) { + out[i*n+j] = x[i*n+j] + y[i*n+j]; + } + } +} + +int main() { +} diff --git a/src/openmp-schedule.c b/src/openmp-schedule.c new file mode 100644 index 0000000..615b5e8 --- /dev/null +++ b/src/openmp-schedule.c @@ -0,0 +1,35 @@ +#include "timing.h" +#include + +int fib(int n) { + if (n <= 1) { + return 1; + } else { + return fib(n-1) + fib(n-2); + } +} + +int main() { + double bef, aft; + + int n = 45; + int *fibs = malloc(n * sizeof(int)); + + bef = seconds(); +#pragma omp parallel for schedule(static) + for (int i = 0; i < n; i++) { + fibs[i] = fib(i); + } + aft = seconds(); + printf("Static scheduling: %fs\n", aft-bef); + + bef = seconds(); +#pragma omp parallel for schedule(dynamic) + for (int i = 0; i < n; i++) { + fibs[i] = fib(i); + } + aft = seconds(); + printf("Dynamic scheduling: %fs\n", aft-bef); + + free(fibs); +} diff --git a/src/sumrows.c b/src/sumrows.c new file mode 100644 index 0000000..d775b31 --- /dev/null +++ b/src/sumrows.c @@ -0,0 +1,10 @@ +void sumrows(int n, int m, + const double *matrix, double *vector) { + for (int i = 0; i < n; i++) { + double sum = 0; + for (int j = 0; j < m; j++) { + sum += matrix[i*m+j]; + } + vector[i] = sum; + } +} diff --git a/tombstone.tex b/tombstone.tex new file mode 100644 index 0000000..d8b2dee --- /dev/null +++ b/tombstone.tex @@ -0,0 +1,76 @@ +% Improved version of +% http://blog.sterex.de/2013/04/latex-macro-for-tombstone-diagrams/ +\def\tcompiler(from #1 to #2 in #3){% + \begin{picture}(100,50) + + \put (0,50) {\line(1,0) {100}} + \put (0,25) {\line(0,1) {25}} + \put (100,50) {\line(0,-1) {25}} + + % __ __ T bottom elements + \put (100,25) {\line(-1,0) {35}} + \put (0, 25) {\line(1,0) {35}} + + % | | left and right + \put (35,25) {\line(0,-1) {25}} + \put (65,25) {\line(0,-1) {25}} + + % |__| bottom + \put (35,0) {\line(1,0) {30}} + + % From -> To + \put(10,37){\makebox(0,0) {#1}} + \put(50,37){\makebox(0,0) {$\rightarrow$}} + \put(80,37){\makebox(0,0) {#2}} + + % In + \put(50,10){\makebox(0,0) {#3}} + \end{picture} +} + +\def\tmachine(#1){% + \begin{picture}(30,25) + \put (0,25) {\line(1,0) {30}} + \put (0,25) {\line(2,-3) {15}} + \put (30,25) {\line(-2,-3) {15}} + + \put (15,20) {\makebox(0,0) {#1}} + \end{picture} +} + +\def\tprog(#1 in #2){% + \begin{picture}(50,45) + % Upper trapezoid + \put (10,25) {\line(-1,2) {10}} + \put (40,25) {\line(1,2) {10}} + \put (0,45) {\line(1,0) {50}} + \put (25,35) {\makebox(0,0) {#1}} + + % Lower box + \put (10,0) {\line(1,0) {30}} + \put (10,0) {\line(0,1) {25}} + \put (40,0) {\line(0,1) {25}} + \put (25,13) {\makebox(0,0) {#2}} + \end{picture} +} + +\def\tinter(#1 in #2){% + \begin{picture}(30,50) + % Upper box + \put (0,25) {\line(0,1) {25}} + \put (0,50) {\line(1,0) {30}} + \put (30,25) {\line(0,1) {25}} + \put (15,38) {\makebox(0,0) {#1}} + + % Lower box + \put (0,0) {\line(1,0) {30}} + \put (0,0) {\line(0,1) {25}} + \put (30,0) {\line(0,1) {25}} + \put (15,13) {\makebox(0,0) {#2}} + \end{picture} +} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: "notes" +%%% End: diff --git a/transformations.tex b/transformations.tex new file mode 100644 index 0000000..07edd0d --- /dev/null +++ b/transformations.tex @@ -0,0 +1,658 @@ +\lstset{language=C} + +\chapter{Loop Transformations} +\label{chap:loop-transformations} + +\begin{quote} + \emph{This chapter is adapted with permission from the PMPH Lecture + Notes written by Cosmin Oancea.} +\end{quote} + +This chapter is organised as follows: +\begin{itemize} +\item \Cref{sec:loop-interch} presents a simple theorem that gives + necessary conditions for the safety of the transformation that + interchanges two perfectly nested loops. +\item \Cref{sec:loop-distrib} discusses the legality and + the manner in which a loop can be distributed across the + statements in its body. +\item \Cref{sec:false-dep-elim} discusses techniques for + eliminating cross-iteration write-after-read and + write-after-write dependencies. +\item \Cref{sec:strip-tiling} introducing a simple + transformation, named stripmining, which is always valid, and + shows how block and register tiling can be derived as a + combination of stripmining, loop interchange and loop + distribution. +\end{itemize} + +\section{Loop interchange: legality and applications} +\label{sec:loop-interch} + +Direction vectors are not used only for proving the parallel nature of +loops, but can also enable powerful code restructuring techniques. For +example they can be straightforwardly applied to determine whether it +is safe to interchange two loops in a perfect loop nest\footnote{ A + perfect loop nest is a nest in which any two loops at consecutive + depth levels are not separated by any other statements; for example + all loop nests in \cref{fig:data-dep-running-eg} are perfectly + nested. }---which may result in better locality and even in +changing an inner loop nature from dependent (sequential) to +independent (parallel). + +The following theorem gives a sufficient condition for the +legality of loop interchange---i.e., for the transformation +to result in code that is semantically equivalent to the original one. + +\begin{theorem}[Legality of Loop Interchange]\label{Loop-Interch} + A sufficient condition for the legality of interchanging two loops + at depth levels $k$ and $l$ in a perfect nest is that interchanging + columns $k$ and $l$ in the direction matrix of the loop nest + \emph{does not result} in a (leading) \texttt{>} direction as the + leftmost non-\texttt{=} direction of any row. +\end{theorem} + +The theorem above shows that the legality of loop interchange +can be determined solely by inspecting the result of permuting +the direction matrix in the same way as the one desired for loops. +For the rationale related to why a row-leading \texttt{>} direction +is illegal, we refer the reader to \cref{Leg-Dir-Vect}: a non-\texttt{=} +leading \texttt{>} direction would correspond to depending on something +that happens in the future: this currently seems impossible in our +universe, and as such it signals an illegal transformation. +% +The following corollary can be easily derived from \cref{Loop-Interch}: + +\begin{corollary}[Interchanging a parallel loop inwards]\label{Par-Loop-Interch} + In a perfect loop nest, it is always safe to interchange a + {\em parallel loop} inwards one step at a time (i.e., if the + parallel loop is the $k^{th}$ loop in the nest then one can + always interchange it with loop $k+1$, then with loop $k+2$, etc.). +\end{corollary} + +The corollary says that if we somehow know the parallel nature +of a loop, then we can safely interchange it in the immediate inward +position, without even having to build the dependence-direction matrix. + +Let us analyse the legality of loop interchange for the three loop +nests of our running example: + +\begin{description} +\item[\Cref{fig:data-dep-running-eg-a}:] The direction matrix is + \texttt{[=,=]} and, as such, it is legal to interchange the two + loops, because it would result in direction matrix + \texttt{[=,=]}. Moreover applying loop interchange in this case is + highly beneficial because it \emph{optimises locality of reference}: + the loop of index \texttt{i} appears in the innermost position after + the interchange, which optimally exploits spatial locality for the + write and read accesses to \texttt{A[j][i]}. + +\item[\Cref{fig:data-dep-running-eg-b}:] The direction matrices are + \[ + M + = \begin{cases}\texttt{[<,<]}\\\texttt{[=,<]}\end{cases} + \] + and + \[ + M^{intchg} + = \begin{cases}\texttt{[<,<]}\\\texttt{[<,=]}\end{cases} + \] + before and after interchange, respectively. It follows that the + loop interchange is legal---because $M^{intchg}$ satisfies + \cref{Loop-Interch}---and it also optimises spatial locality (as + before). What is interesting about this example is that after the + interchange, \emph{the innermost loop has become parallel}, by + \cref{Loop-Par}, because the outer loop caries all + dependencies---the direction column corresponding to the outer loop + consists only of \texttt{<} directions. + +\item[\Cref{fig:data-dep-running-eg-c}:] The direction matrix is + \texttt{[<,>]} and \emph{interchanging the two loops is illegal} + because the direction matrix obtained after the interchange + \texttt{[>,<]} starts with a \texttt{>} direction; this would mean + that the current iteration depends on a future iteration, which is + impossible, hence the interchange is illegal. +\end{description} + +\section{Loop distribution: legality and applications} +\label{sec:loop-distrib} + +This section introduces a transformation, named loop distribution, +where a loop is distributed across its statements. Potential benefits +are: +\begin{itemize} +\item Loop distribution provides the bases for performing + vectorisation: the innermost loop is distributed across its + statements, and then the distributed loops are chunked (stripmined, + \cref{sec:strip-tiling}) by a factor that permits utilisation of + processor's vector instructions. +\item Loop distribution may enhance the degree of parallelism that can + be statically mapped to the hardware. As discussed in + \cref{sec:openmp-nesting}, OpenMP \texttt{collapse} clauses only + apply to perfect loop nests. Distribution lets us split apart + complex loop nests to create perfect nests of parallel constructs, + which can then be parallelised efficiently with OpenMP. +\end{itemize} + +Loop distribution requires the construction of a dependency graph, +which is defined below. + +\begin{definition}[Dependency graph]\label{Dep-Graph} + A dependency graph of a loop is a directed graph in which + the nodes correspond to the statements of the loop nest and the + edges correspond to dependencies. An edge is directed (points) + from the source to the sink of the dependency, and is annotated + with the direction corresponding to that dependence. + + In the case when the loop contains another inner loop, then + the inner loop is represented as a single statement that conservatively + summarises the behavior of all the statements of the inner loop. +\end{definition} + +The dependency graph of a loop can be used to characterise its +parallel behavior: +\begin{theorem}[Dependency cycle]\label{Dep-Cycle} + A loop is parallel \emph{if and only if} its dependency graph does + not have cycles. +\end{theorem} +If the loop contains a cycle of dependencies, then it necessarily +exhibits at least a cross iteration dependency (needed to form the +cycle), and thus the loop is not parallel. The following theorem +specifies how the transformation can be implemented: + +\begin{theorem}[Loop distribution]\label{Loop-Distrib} + Distributing a loop across its statements can be performed + in the following way: + \begin{enumerate} + \item The dependency graph corresponding to the target loop + is constructed. + \item The graph is decomposed into strongly-connected components + (SCCs)\footnote{ + A graph is said to be strongly connected if every vertex + is reachable from every other vertex, i.e., a cycle. + It is possible to find the strongly-connected components + of an arbitrary directed graph in linear time $\Theta(V+E)$, + where $V$ is the number of vertices and $E$ is the number of + edges. + }, and a new graph $G'$ is formed in which the SCCs are nodes. + \item The loop can be safely distributed across its + strongly-connected components, in the graph order of $G'$. + Assuming a number $k$ of SCCs, this means that the result of the + transformation will be $k$ loops, each containing the statements + of the corresponding SCC. Inside an SCC, the statements remain in + program order, but the distributed loops are ordered according to + $G'$. + \item Array expansion (\cref{sec:array-expansion}) must be performed + for the variables that + \begin{itemize} + \item are either declared inside the loop or overwritten + in each iteration (output dependencies), \textbf{\em and} + \item are used in at least two strongly-connected components. + \end{itemize} + \end{enumerate} +\end{theorem} + +The theorem above says that the statements that are in a dependency +cycle must remain in (form) one loop (which is sequential by +\cref{Dep-Cycle}). As such, the loop can be distributed across groups +of statements corresponding to the strongly connected components (SCC) +of the dependency graph. If the graph has only one SCC then it cannot +be distributed. The resulting distributed loops are written in the +order dictated by the graph of SCCs. +% +We demonstrate \cref{Loop-Distrib} on the simple code example presented +below: +\begin{lstlisting}[mathescape=true] +for (int i = 2; i < N; i++) { + $S_1:$ A[i] = B[i-2] ... + $S_2:$ B[i] = B[i-1] ... +} +\end{lstlisting} + +The code has two dependencies: +\begin{description} +\item[$S_2\to S_1$:] In order for a dependency on \texttt{B} to exist + the read from \texttt{B} in iteration $i_1$ of $S_1$ and + the write to \texttt{B} in iteration $i_2$ of $S_2$ must refer + to the same location. Hence \texttt{i$_1$-2 = $i_2$}, which means + \texttt{i$_1$ > i$_2$}, hence $S_2$ is the source, $S_1$ is the + sink and the direction vector is \texttt{[<]}; +\item[$S2\to S2$:] similarly, there is a dependency between the + read from \texttt{B} in $S_2$ and the write to \texttt{B} in $S_2$ + of direction vector \texttt{[<]}. +\end{description} + +The dependency graph is thus: +\begin{center} +\includegraphics[height=15ex]{img/LoopDistr.pdf} +\end{center} + +\noindent and it exhibits two strongly-connected components: +one formed by statement $S_2$ and one formed by statement $S_1$. +Loop distribution results in the following restructured code: +\begin{lstlisting}[mathescape=true] +for (int i = 2; i < N; i++) + $S_2:$ B[i] = B[i-1] ... +for (int i = 2; i < N; i++) + $S_1:$ A[i] = B[i-2] ... +\end{lstlisting} +in which, according to the graph order, the loop corresponding to +statement $S_2$ appears before the one corresponding to statement +$S_1$. Note that this does not match the program order of statements +$S_1$ and $S_2$ in the original program. Also note that the first loop +is \emph{not} parallel because the SCC consisting of $S_2$ has a +(dependency) cycle, but the second loop is parallel because the SCC +corresponding to $S_1$ does not have cycles. + +If a loop is parallel then it can be straightforwardly distributed +across its statements in program order because: +\begin{itemize} +\item by \cref{Dep-Cycle}, the loop dependency graph has no cycles and + thereby each statement is a strongly connected component; +\item the program order naturally respects all dependencies. +\end{itemize} + +\begin{corollary}[Parallel loop distribution]\label{Par-Loop-Distr} + A parallel loop can be directly distributed across each one of its + statements. The resulting loops appear in the same order in which + their corresponding statements appear in the original loop. +\end{corollary} + +\subsection{Array expansion} +\label{sec:array-expansion} + +Finally, it remains to demonstrate array expansion, mentioned +in the fourth bullet of \cref{Loop-Distrib}. Assume the slightly +modified code: +\begin{lstlisting}[mathescape=true] +float tmp; +for (int i = 2; i < N; i++) { + $S_1:$ tmp = 2 * B[i-2]; + $S_2:$ A[i] = tmp; + $S_3:$ B[i] = tmp + B[i-1]; +} +\end{lstlisting} +Statements $S_1$ and $S_3$ are in a dependency cycle, because +there is a dependency $S_3\to S_1$ with direction \texttt{<} caused +by the write to and the read from array \texttt{B}, and a dependency +$S_1\to S_3$ with direction \texttt{=} caused by \texttt{tmp}. +Statement $S_2$ is not in a dependency cycle, but there is a +dependency $S_1\to S_2$, and hence its distributed loop +should follow the distributed loop containing $S_1$ and $S_3$. +If we do not perform array expansion, the distributed code: + +\pagebreak +\begin{lstlisting}[mathescape=true] +float tmp; +for (int i = 2; i < N; i++) { + $S_1:$ tmp = 2 * B[i-2]; + $S_3:$ B[i] = tmp + B[i-1]; +} +for (int i = 2; i < N; i++) { + $S_2:$ A[i] = tmp; +} +\end{lstlisting} +\noindent does not respect the semantics of the original program +because the second loop uses the same value of \texttt{tmp}---the one +set by the last iteration of the first loop---while the original loop +writes and then reads a different value of \texttt{tmp} for each +iteration. We fix this by performing array expansion for \texttt{tmp}, +which means that we must expand it with an array dimension equal to +the loop count and replace its uses with corresponding indexing +expressions of the expanded array. This results in the following +\emph{correct} code: + +\begin{lstlisting}[mathescape=true] +float tmp[N]; +for (int i = 2; i < N; i++) { + $S_1:$ tmp[j] = 2 * B[i-2]; + $S_3:$ B[i] = tmp + B[i-1]; +} +for (int i = 2; i < N; j++) { + $S_2:$ A[i] = tmp[j]; +} +\end{lstlisting} + +Array expansion requires us to normalise the loop first---this means +rewriting the loop such as its index starts from $0$ and increases by +$1$ each iteration. This is why we have not written our example as +\lstinline{do i = 2, N+1}. + +\section{Eliminating false dependencies (WAR and WAW)} +\label{sec:false-dep-elim} + +Anti and output dependencies are often referred to as \emph{false} +dependencies because they can be eliminated in most cases by copying +or privatisation operations: +\begin{itemize} +\item Cross-iteration anti dependencies (WAR) typically + correspond to a read from some original element of + the array---whose value was set before the start of + the loop execution---followed by an update to that + element in a later iteration. As such, this dependency + can be eliminated by copying (in parallel) the target + array before the loop and rewriting the offending + read access inside the loop such that it refers + to the copy of the array. + +\item Cross-iteration output dependencies (WAW) can be + eliminated by a technique named privatisation (or renaming), + whenever it can be determined that {\em every read access} + from a scalar or array location {\em is covered by an + update} to that scalar or memory location that was + previously performed {\em in the same iteration}. + Semantically, privatisation moves the declaration of the + offending variable inside the loop, because it has been + already determined that the read/used value was produced + earlier in the same iteration. + +\item Reasoning based on direction vectors is limited to + relatively simple loop nests; for example it is difficult + to reason about privatisation by means of direction vectors. +\end {itemize} + +\subsection{Eliminating WAR dependencies by copying} + +Consider the simple C code below which rotates an array in the right +dimension by one: + +\begin{lstlisting}[mathescape=true] +float tmp = A[0]; +for (int i=0; i