-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy pathabstract.tex
99 lines (84 loc) Β· 4.56 KB
/
abstract.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
%
% Submit a single PDF file on the HotCRP PLDI16 Posters submission
% system with the following contents (at least 3, at most 5 pages):
%
% * An abstract of at most two pages (preferably just one full page)
% describing the research you want to present including how it is
% relevant to PLDI
%
% * A one page draft of your poster (shrink to fit in a single 8.5inx11
% in page) in portrait layout
%
% * A letter of support (at most two pages, preferably one) from your
% academic adviser or PhD supervisor for including your poster in the
% student poster session at PLDI16
%
% Contact the PLDI16 Poster Chair with questions.
%
\input{preamble}
\usepackage{pdfpages}
\begin{document}
\title{Autotuning OpenCL Workgroup Sizes}
\author{Chris Cummins, University of Edinburgh}
\maketitle
The physical limitations of microprocessor design have forced the
industry towards increasingly heterogeneous designs to extract
performance, with an an increasing pressure to offload traditionally
CPU based workloads to the GPU. This trend has not been matched with
adequate software tools; the popular languages OpenCL and CUDA provide
a very low level model with little abstraction above the
hardware. Programming at this level requires expert knowledge of both
the domain and the target hardware, and achieving performance requires
laborious hand tuning of each program. This has led to a growing
disparity between the availability of parallelism in modern hardware,
and the ability for application developers to exploit it.
The goal of this work is to bring the performance of hand tuned
heterogeneous code to high level programming, by incorporating
autotuning into \textit{Algorithmic Skeletons}. Algorithmic Skeletons
simplify parallel programming by providing reusable, high-level,
patterns of computation. However, achieving performant skeleton
implementations is a difficult task; skeleton authors must attempt to
anticipate and tune for a wide range of architectures and use
cases. This results in implementations that target the general case
and cannot provide the performance advantages that are gained from
tuning low level optimization parameters for individual programs and
architectures. Autotuning combined with machine learning offers
promising performance benefits by tailoring parameter values to
individual cases, but the high cost of training and the ad-hoc nature
of autotuning tools limits the practicality of autotuning for real
world programming. We believe that performing autotuning at the level
of the skeleton library can overcome these issues.
In this work, we present \textit{OmniTune} --- an extensible and
distributed framework for autotuning optimization parameters in
algorithmic skeletons at runtime. OmniTune enables a collaborative
approach to performance tuning, in which machine learning training
data is shared across a network of cooperating systems, amortizing the
cost of exploring the optimization space. We demonstrate the
practicality of OmniTune by autotuning the OpenCL workgroup size of
stencil skeletons in SkelCL. SkelCL is an Algorithmic Skeleton
framework which abstracts the complexities of OpenCL programming,
exposing a set of data parallel skeletons for high level heterogeneous
programming in C++. Selecting an appropriate OpenCL workgroup size is
critical for the performance of programs, and requires knowledge of
the underlying hardware, the data being operated on, and the program
implementation. Our autotuning approach employs the novel application
of linear regressors for classification of workgroup size, extracting
102 features at runtime describing the program, device, and dataset,
and predicting optimal workgroup sizes based on training data
collected using synthetically generated stencil benchmarks.
In an empirical study of 429 combinations of programs, architectures,
and datasets, we find that OmniTune provides a median $3.79\times$
speedup over the best possible fixed workgroup size, achieving 94\% of
the maximum performance. Our results demonstrate that autotuning at
the skeletal level --- when combined with sophisticated machine
learning techniques --- can raise the performance above that of human
experts, without requiring any effort from the user. By introducing
OmniTune and demonstrating its practical utility, we hope to
contribute to the increasing uptake of autotuning techniques into
tools and languages for high level programming of heterogeneous
systems.
\newpage
\includepdf{poster.pdf}
\newpage
\includepdf{hugh-letter.pdf}
\end{document}