Report

Program 1: 
查看mandelbrot set的具体计算逻辑可知，每次task运行loop的次数是不一样的，亮点处需要大量循环而暗点处几乎瞬间执行完。
插入time record后可以看到：

view 1
threads = 2, 预期对半分
Hello world from thread 0, [102.555] ms
Hello world from thread 1, [104.532] ms
[mandelbrot thread]:		[102.241] ms
Wrote image file mandelbrot-thread.ppm
				(1.97x speedup from 2 threads)

threads = 3, 预期idx1任务量大
Hello world from thread 0, [40.951] ms
Hello world from thread 1, [125.207] ms
Hello world from thread 2, [41.255] ms
[mandelbrot thread]:		[124.376] ms
Wrote image file mandelbrot-thread.ppm
				(1.61x speedup from 3 threads)

threads = 4, 预期idx0=idx3, idx1=idx2, 且idx1/2任务量大
Hello world from thread 0, [19.046] ms
Hello world from thread 1, [82.858] ms
Hello world from thread 2, [83.741] ms
Hello world from thread 3, [20.153] ms
[mandelbrot thread]:		[83.813] ms
Wrote image file mandelbrot-thread.ppm
				(2.40x speedup from 4 threads)

而view 2任务比较均衡，相对来说加速效果较好。

预测和观察到的时间符合，结论是每个thread负责的任务量不一致，负载不均衡导致加速没有达到期望中的效果。

TODO: 找一个合适的分割使得view 1/2在threads=4时候的加速比都能达到3.5X

Program 2:
1. 给的abs example存在什么问题?
当 (N % VECTOR_WIDTH) != 0时， 最后一个iteration做_cmu418_vload_float(x, values+i, maskAll) 会访存越界。

2. size = 10000
width = 2
Vector Width:              2
Total Vector Instructions: 167727
Vector Utilization:        77.616901%
Utilized Vector Lanes:     260369
Total Vector Lanes:        335454

width = 4
Vector Width:              4
Total Vector Instructions: 97075
Vector Utilization:        70.457121%
Utilized Vector Lanes:     273585
Total Vector Lanes:        388300

width = 8
Vector Width:              8
Total Vector Instructions: 52877
Vector Utilization:        66.728682%
Utilized Vector Lanes:     282273
Total Vector Lanes:        423016

width = 16
Vector Width:              16
Total Vector Instructions: 27592
Vector Utilization:        64.988267%
Utilized Vector Lanes:     286905
Total Vector Lanes:        441472

可以看到Vector Utilization下降了，分析代码猜测是因为while循环的原因：while循环的次数是由width长度的data中最大的y决定的，其余早早降到0的y占用的lane都是额外的cost,所以width越长，就有越多lane被bound住越久。

Extra credit: Reduction by [0 1 2 3 4 5 6 7] hadd-> [1 1 5 5 9 9 13 13] interleave-> [1 5 9 13 1 5 9 13] maskHalf-> hadd loop

Program 3:
Part 1
[mandelbrot serial]:		[203.058] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:		[44.666] ms
Wrote image file mandelbrot-ispc.ppm
				(4.55x speedup from ISPC)
达到了4.55X, 而我的CPU是6核12线程的Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz。这里single core的理论speedup应该是2(2 threads per core) * 4(SSE Width)? 不太确定超线程和ISPC的结合方式。
达不到理论speedup的原因应该是mandel函数里任务不均衡，白点处count越多， 执行计算越多，VectorWidth能被使用得更充分。而黑点处都在执行if/for这些Program2中我们知道要大量mask的命令，SIMD的不能充分发挥作用。

Part 2
1.
[mandelbrot serial]:		[202.670] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:		[44.364] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:	[23.285] ms
Wrote image file mandelbrot-task-ispc.ppm
				(4.57x speedup from ISPC)
				(8.70x speedup from task ISPC)

可以观察到tasks version的code比原始版本快了一倍, amazing!

2. 直接拆分成height个task，每个处理一行
[mandelbrot serial]:		[204.312] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:		[44.477] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:	[7.074] ms
Wrote image file mandelbrot-task-ispc.ppm
				(4.59x speedup from ISPC)
				(28.88x speedup from task ISPC)

可以看到在我的工作PC上达到了28.88倍的speedup，基本是我单任务的6倍，这样看single core ISPC也能充分利用超线程技术。
具体的原因如下：查看ISPC task的document可以看到它采用的是worker pool的model，启动min(task number, processcor num)数量的worker，执行完一个task后去执行下一个。
党任务能完美均分的时候，只需要开core数量个任务就能达到最高speedup。而当任务不能平均分配的时候，尽可能多的拆分能使得尽量多的处理器时间被用到。

3. 10000个threads和10000个tasks的区别
上面说了，10000个tasks并不会真的开10000个线程，而是compiler决定开合理数量的worker pool去执行所有tasks，这样启动和关闭线程的开销会小。

Program4:
1.
[sqrt serial]:		[744.294] ms
[sqrt ispc]:		[271.795] ms
[sqrt task ispc]:	[30.082] ms
				(2.74x speedup from ISPC)
				(24.74x speedup from task ISPC)

非task ispc的2.74X是纯由SIMD带来的加速，64tasks ispc使用到了6core带来的加速。还有一部分应该是由worker pool模型对任务更细致的切分带来的收益。

2.
将values修改为全3.0f
我机子直接挂了。。。

将values修改为全2.99f
[sqrt serial]:		[1371.420] ms
[sqrt ispc]:		[402.022] ms
[sqrt task ispc]:	[46.270] ms
				(3.41x speedup from ISPC)
				(29.64x speedup from task ISPC)

将values修改为全2.9f
[sqrt serial]:		[762.937] ms
[sqrt ispc]:		[246.095] ms
[sqrt task ispc]:	[28.947] ms
				(3.10x speedup from ISPC)
				(26.36x speedup from task ISPC)

将values修改为全2.0f
[sqrt serial]:		[258.432] ms
[sqrt ispc]:		[90.933] ms
[sqrt task ispc]:	[13.664] ms
				(2.84x speedup from ISPC)
				(18.91x speedup from task ISPC)

将values修改为全1.0f
[sqrt serial]:		[20.795] ms
[sqrt ispc]:		[11.096] ms

[sqrt task ispc]:	[10.191] ms
				(1.87x speedup from ISPC)
				(2.04x speedup from task ISPC)

观察一下应该是3的时候最好，这时候计算量达到最大，能充分体现SIMD的优势。

3.
1的时候最差，这时候几乎没有data parallel了，SIMD不但没有效果而且还因为mask导致变慢了。1.87x大概来自超线程。

Program 5:
1.
[saxpy ispc]:		[13.782] ms	[21.624] GB/s	[2.902] GFLOPS
[saxpy task ispc]:	[13.766] ms	[21.649] GB/s	[2.906] GFLOPS
				(1.00x speedup from use of tasks)

without_task和with_task一样
这应该是Lecture2讲过的带宽限制问题，saxpy是一个只有一次加法一次乘法计算的程序，不是计算密集型，主要瓶颈在内存带宽上。

2.SIMD指令是以4Width访问的，无论实际数据有多少。

3.TODO: Improve the performance of saxpy.