-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroduction.tex
171 lines (148 loc) · 7.84 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
\documentclass[main]{subfiles}
\begin{document}
This thesis is about metaprogramming techniques for parallel code generation.
It aims to study the boundary between compile-time parsing of
\glspl{dsel} and high performance code generation in \cpp.
The main motivation is to provide tools, libraries, and guidelines for embedding
mathematical languages in \cpp with the hope that it can be useful to build a
cohesive set of tools to make generic \gls{hpc} more accessible
to non-expert audiences. This goal is even more important as new computing
architectures emerge over time. Developing high performance programs requires
tuned implementations that can be achieved by either specializing
implementations for the target platforms, or using libraries that provide
specialized abstractions at various levels and for various domains.
% TODO rephraser?
\section{
Towards more parallelism
}
Years after its introduction, Moore's law is still alive and well:
even though computing power itself does not evolve as quickly,
transistor count is still doubling every three year.
While transistor count has been increasing almost exponentially,
the throughput computing performance of single \gls{cpu} cores has not.
This lead to the conclusion that "free lunch" is over, and future
high performance applications must adapt to increasingly parallel architectures
to increase compute throughput
\cite{concurrency-revolution, doi:10.1142/S0129626404001829}.
\begin{center}
\begin{tabular}{p{0.6\textwidth} p{0.3\textwidth}}
\textbf{Vector registers and instructions}\footnotemark{} allow single
instructions to be executed on several values at a time. This is known as
the \textbf{\gls{simd}} execution model. It is featured across all major
instruction set architectures:
\textbf{AVX} and \textbf{SSE} on x86, \textbf{NEON} and \textbf{SVE} on ARM,
\textbf{VSX} on PowerPC, and \textbf{RVV} on RISC-V.
&
\raisebox{-.925\height}{\includesvg[width=0.3\textwidth]{images/SIMD}}
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{p{0.3\textwidth} p{0.6\textwidth}}
\raisebox{-.925\height}{\includesvg[width=0.3\textwidth]{images/MISD}}
&
\textbf{Multi-core architectures} enables multiple CPU cores to execute
different tasks simultaneously. These cores may or may not share the same
memory pool. If so, the architecture is qualified as a \textbf{\gls{uma}}.
Otherwise, the architecture is qualified as a \textbf{\gls{numa}}.
\end{tabular}
\end{center}
\footnotetext{Source of the graphs:
\url{https://en.wikipedia.org/wiki/Flynn's_taxonomy}}
\textbf{\glspl{gpu}} can be used for general purpose
computing. They can execute "warps" in parallel. Warps are similar to threads,
except for the fact that they must be synchronous at all time for their
execution to be carried simultaneously. When warps are not synchronous,
their execution is carried separately until they are synchronous again.
This is known as the \gls{simt} execution model \cite{nvidia-tesla}.
\glspl{gpu} can be programmed via many \glspl{api},
including but not limited to:
\begin{figure}[h]
\fontsize{8}{10}\selectfont
\includesvg[width=\textwidth]{images/nvidia-fermi}
\caption{NVIDIA GeForce 8800 architecture (Source: NVIDIA)}
\label{fig:nvidia-fermi}
\end{figure}
\begin{itemize}
\item
\textbf{CUDA}: NVIDIA's proprietary \cpp-based language and library
ecosystem, which is the leading solution for \gls{gpu} computing.
While NVIDIA provides their own proprietary compilers, the Clang/LLVM
compiler can also compile CUDA programs.
\item
\textbf{OpenCL}: a C-based open standard for heterogeneous platforms.
\item
\textbf{HIP}: AMD's own open-source solution. Its \gls{api} is meant to
resemble CUDA's and is implemented through an open-source fork of the
Clang/LLVM compiler.
\item
\textbf{SYCL}: a \cpp-based open standard for \gls{gpu} and
accelerator programming. It is mostly lead by Intel, but it does support
all major \gls{gpu} brands. The main SYCL implementation is provided by
Intel OneAPI, which also features an open-source fork of Clang/LLVM.
\end{itemize}
Note that the execution model of \glspl{gpu} are highly vectorized processors.
However, vector instructions on \glspl{cpu} are often used direclty through
intrinsics, whereas \gls{gpu} computing \glspl{api} offer much richer constructs
to write compute kernels.
\section{
The need for new abstractions
}
To adapt to this variety of hardware and \glspl{api}, new abstractions
were developed in the form of software libraries, compiler extensions,
and even programming languages to help build portable high performance
applications.
Among these abstractions, a lot of them use \textbf{metaprogramming} for the
development of high performing, easy to use, and portable abstractions.
Metaprogramming is the ability to treat code as an input or output of a program.
Therefore, it allows the generation of programs specialized for
specific hardware architectures using high level, declarative \glspl{api}.
In \cpp, metaprogramming is mostly done through \gls{tmp}, which is a set of
techniques that rely on \cpp templates to represent values using types to
perform arbitrary computations at compile-time.
Type templates can even be used to represent expression trees,
which can be transformed into code. This kind of application is particularly
useful for the implementation of \cpp-based \glspl{dsel} that can build
compile-time representations of math expressions and transform them into
optimized code. Without metaprogramming, these abstractions would require
the introduction of compiler plugins or source-to-source compilers in the
compilation toolchain, making it more complex.
However, \gls{tmp} has been an esoteric practice since types are by definition
not meant to be used as values. For this reason, \cpp is evolving to provide
first class support for metaprogramming: regular \cpp code is now partially
allowed to be executed at compile-time to produce results that can be consumed
by templates. On the other hand, more \cpp libraries start relying on \cpp
metaprogramming. As such, \cpp metaprogramming becomes more mainstream,
and the amount of compile-time computing is increasing, along with
the complexity of \cpp metaprograms.
\\
% Exposing the objectives + outlining the plan
\section{
Thesis outline
}
This trend, as well as new \cpp compile-time capabilities, raises new questions:
how can we leverage compile-time \cpp code execution to improve the quality
and performance of \cpp metaprograms, especially \glspl{dsel} implementations
for \acrlong{hpc} code generation? Eventually, can we use them to parse
\glspl{dsel} directly instead of reusing the \cpp syntax?
And as the impact of these metaprograms on compilation times becomes an issue,
how can we study their impact in a comprehensive and reproductible way?
\\
In the first part of the thesis, we will take a look at the state of the art of
metaprogramming across many languages and in \cpp.
We will then see how it can be used to generate high performance linear algebra
kernels and compare the performance of the generated kernels against
hand-written assembly kernels.
In the second part we will focus on \textbf{ctbench}, a tool for the scientific
study of \cpp metaprograms compile-time performance. It implements a new
benchmarking method that enables the study of metaprogram performance
at scale, and through the lens of Clang's built-in profiler.
We will then use it to study novel \cpp metaprogramming techniques based on
compile-time \cpp execution. The study will be conducted on two embedded
languages that are parsed and transformed into \cpp code, within the \cpp
compilation process itself.
The first language is Brainfuck, which is provided with several code generation
backends to compare different code generation techniques. The second one
is a simple math language called Tiny Math Language. It demonstrates
the usability of compile-time \cpp execution to implement a complete compiling
toolkit for math formulas, within the \cpp compilation process.
\end{document}