forked from cmu15418/assignment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReport
172 lines (139 loc) · 6.55 KB
/
Report
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
Program 1:
查看mandelbrot set的具体计算逻辑可知,每次task运行loop的次数是不一样的,亮点处需要大量循环而暗点处几乎瞬间执行完。
插入time record后可以看到:
view 1
threads = 2, 预期对半分
Hello world from thread 0, [102.555] ms
Hello world from thread 1, [104.532] ms
[mandelbrot thread]: [102.241] ms
Wrote image file mandelbrot-thread.ppm
(1.97x speedup from 2 threads)
threads = 3, 预期idx1任务量大
Hello world from thread 0, [40.951] ms
Hello world from thread 1, [125.207] ms
Hello world from thread 2, [41.255] ms
[mandelbrot thread]: [124.376] ms
Wrote image file mandelbrot-thread.ppm
(1.61x speedup from 3 threads)
threads = 4, 预期idx0=idx3, idx1=idx2, 且idx1/2任务量大
Hello world from thread 0, [19.046] ms
Hello world from thread 1, [82.858] ms
Hello world from thread 2, [83.741] ms
Hello world from thread 3, [20.153] ms
[mandelbrot thread]: [83.813] ms
Wrote image file mandelbrot-thread.ppm
(2.40x speedup from 4 threads)
而view 2任务比较均衡,相对来说加速效果较好。
预测和观察到的时间符合,结论是每个thread负责的任务量不一致,负载不均衡导致加速没有达到期望中的效果。
TODO: 找一个合适的分割使得view 1/2在threads=4时候的加速比都能达到3.5X
Program 2:
1. 给的abs example存在什么问题?
当 (N % VECTOR_WIDTH) != 0时, 最后一个iteration做_cmu418_vload_float(x, values+i, maskAll) 会访存越界。
2. size = 10000
width = 2
Vector Width: 2
Total Vector Instructions: 167727
Vector Utilization: 77.616901%
Utilized Vector Lanes: 260369
Total Vector Lanes: 335454
width = 4
Vector Width: 4
Total Vector Instructions: 97075
Vector Utilization: 70.457121%
Utilized Vector Lanes: 273585
Total Vector Lanes: 388300
width = 8
Vector Width: 8
Total Vector Instructions: 52877
Vector Utilization: 66.728682%
Utilized Vector Lanes: 282273
Total Vector Lanes: 423016
width = 16
Vector Width: 16
Total Vector Instructions: 27592
Vector Utilization: 64.988267%
Utilized Vector Lanes: 286905
Total Vector Lanes: 441472
可以看到Vector Utilization下降了,分析代码猜测是因为while循环的原因:while循环的次数是由width长度的data中最大的y决定的,其余早早降到0的y占用的lane都是额外的cost,所以width越长,就有越多lane被bound住越久。
Extra credit: Reduction by [0 1 2 3 4 5 6 7] hadd-> [1 1 5 5 9 9 13 13] interleave-> [1 5 9 13 1 5 9 13] maskHalf-> hadd loop
Program 3:
Part 1
[mandelbrot serial]: [203.058] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [44.666] ms
Wrote image file mandelbrot-ispc.ppm
(4.55x speedup from ISPC)
达到了4.55X, 而我的CPU是6核12线程的Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz。这里single core的理论speedup应该是2(2 threads per core) * 4(SSE Width)? 不太确定超线程和ISPC的结合方式。
达不到理论speedup的原因应该是mandel函数里任务不均衡,白点处count越多, 执行计算越多,VectorWidth能被使用得更充分。而黑点处都在执行if/for这些Program2中我们知道要大量mask的命令,SIMD的不能充分发挥作用。
Part 2
1.
[mandelbrot serial]: [202.670] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [44.364] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [23.285] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.57x speedup from ISPC)
(8.70x speedup from task ISPC)
可以观察到tasks version的code比原始版本快了一倍, amazing!
2. 直接拆分成height个task,每个处理一行
[mandelbrot serial]: [204.312] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [44.477] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [7.074] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.59x speedup from ISPC)
(28.88x speedup from task ISPC)
可以看到在我的工作PC上达到了28.88倍的speedup,基本是我单任务的6倍,这样看single core ISPC也能充分利用超线程技术。
具体的原因如下:查看ISPC task的document可以看到它采用的是worker pool的model,启动min(task number, processcor num)数量的worker,执行完一个task后去执行下一个。
党任务能完美均分的时候,只需要开core数量个任务就能达到最高speedup。而当任务不能平均分配的时候,尽可能多的拆分能使得尽量多的处理器时间被用到。
3. 10000个threads和10000个tasks的区别
上面说了,10000个tasks并不会真的开10000个线程,而是compiler决定开合理数量的worker pool去执行所有tasks,这样启动和关闭线程的开销会小。
Program4:
1.
[sqrt serial]: [744.294] ms
[sqrt ispc]: [271.795] ms
[sqrt task ispc]: [30.082] ms
(2.74x speedup from ISPC)
(24.74x speedup from task ISPC)
非task ispc的2.74X是纯由SIMD带来的加速,64tasks ispc使用到了6core带来的加速。还有一部分应该是由worker pool模型对任务更细致的切分带来的收益。
2.
将values修改为全3.0f
我机子直接挂了。。。
将values修改为全2.99f
[sqrt serial]: [1371.420] ms
[sqrt ispc]: [402.022] ms
[sqrt task ispc]: [46.270] ms
(3.41x speedup from ISPC)
(29.64x speedup from task ISPC)
将values修改为全2.9f
[sqrt serial]: [762.937] ms
[sqrt ispc]: [246.095] ms
[sqrt task ispc]: [28.947] ms
(3.10x speedup from ISPC)
(26.36x speedup from task ISPC)
将values修改为全2.0f
[sqrt serial]: [258.432] ms
[sqrt ispc]: [90.933] ms
[sqrt task ispc]: [13.664] ms
(2.84x speedup from ISPC)
(18.91x speedup from task ISPC)
将values修改为全1.0f
[sqrt serial]: [20.795] ms
[sqrt ispc]: [11.096] ms
[sqrt task ispc]: [10.191] ms
(1.87x speedup from ISPC)
(2.04x speedup from task ISPC)
观察一下应该是3的时候最好,这时候计算量达到最大,能充分体现SIMD的优势。
3.
1的时候最差,这时候几乎没有data parallel了,SIMD不但没有效果而且还因为mask导致变慢了。1.87x大概来自超线程。
Program 5:
1.
[saxpy ispc]: [13.782] ms [21.624] GB/s [2.902] GFLOPS
[saxpy task ispc]: [13.766] ms [21.649] GB/s [2.906] GFLOPS
(1.00x speedup from use of tasks)
without_task和with_task一样
这应该是Lecture2讲过的带宽限制问题,saxpy是一个只有一次加法一次乘法计算的程序,不是计算密集型,主要瓶颈在内存带宽上。
2.SIMD指令是以4Width访问的,无论实际数据有多少。
3.TODO: Improve the performance of saxpy.