-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
273 lines (208 loc) · 10.6 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
WHAT IS THIS?
=============
This is a real-time Mandelbrot fractal zoomer.
COMPILE/INSTALL/RUN
===================
Windows
-------
Windows users can download and run a pre-compiled Windows binary
[here](https://github.com/ttsiodras/MandelbrotSSE/releases/download/2.11/mandelSSE-win32-2.11.zip).
After decompressing, you can simply execute either one of the two .bat
files. The 'autopilot' one zooms in a specific location, while the other
one allows you to zoom interactively using your mouse (left-click/hold zooms in,
right-click/hold zooms out).
For those of you that want to build from source code, there are
cross-compilation instructions later in this document.
For Linux/BSD/OSX users
-----------------------
Make sure you have libSDL2 installed. In Debian and its derivatives,
like Ubuntu, just `sudo apt install libsdl2-dev`.
Then, build the code - with...
$ ./configure
$ make
Usage
-----
You can then try these:
$ src/mandelSSE
(Runs in autopilot mode, in a 1024x768 window)
$ src/mandelSSE -m 1280 720
(Runs in mouse-driven mode, in a 1280x720 window)
(left-click/hold zooms-in, right-click/hold zooms out)
Option `-h` gives you additional information about how to control
the Mandelbrot zoomer:
$ ./src/mandelSSE -h
Usage: ./src/mandelSSE [-a|-m] [-h] [-b] [-v|-s|-d] [-i iter] [-p pct] [-f rate] [WIDTH HEIGHT]
Where:
-h Show this help message
-m Run in mouse-driven mode
-a Run in autopilot mode (default)
-b Run in benchmark mode (implies autopilot)
-v Force use of AVX
-s Force use of SSE
-d Force use of non-AVX, non-SSE code
-i iter The maximum number of iterations of the Mandelbrot loop (default: 2048)
-p pct The percentage of pixels computed per frame (default: 0.75)
(the rest are copied from the previous frame)
-f fps Enforce upper bound of frames per second (default: 60)
(use 0 to run at full possible speed)
If WIDTH and HEIGHT are not provided, they default to: 1024 768
For ultimate rendering speed, you can disable the frame limiter (option `-f`).
By default, you are limited to 60fps:
$ src/mandelSSE -m -f 0 1280 720
The benchmarking mode (-b) does this automatically.
If you want to benchmark your CPU only (and not display anything)
tell SDL you don't care about displaying the fractal:
$ SDL_VIDEODRIVER=dummy src/mandelSSE -b 512 384
Be mindful of your CPU's thermal throttling if you are benchmarking :-)
Note that you can force AVX (-v), SSE (-s) or dumb floating point (-d)
to see the speed impact made by our usage of special Intel instructions.
You can also control:
- the percentage of pixels actually computed per frame, with option `-p`.
If you e.g. pass `-p 0.5`, then 100-0.5 = 99.5% of the pixels will be
copied from the previous frame, and only 0.5% will be actually derived
through the Mandelbrot computations. Amazingly, this is enough for
a decent quality fly-through zoom in the fractal.
By default, this is set to 0.75.
- the number of Mandelbrot iterations (option `-i`). By default this is
set to 2048 to allow for decent zoom levels, but if you want to see
insane speeds, set this to something low, like 128; and disable the
frame limiter; i.e. use `-f 0 -i 128`.
WHAT IS THIS, AGAIN?
====================
Long story.
When I got my hands on an SSE enabled processor (an Athlon-XP, back in 2002),
I wanted to try out SSE programming... And over the better part of a weekend,
I created a simple implementation of a Mandelbrot zoomer in SSE assembly.
I was glad to see that my code was almost 3 times faster than pure C.
But that was just the beginning.
Over the last two decades, I kept coming back to this, enhancing it.
- I learned how to use the GNU autotools, and made it work on most Intel
platforms: checked with Linux, Windows (MinGW) and OpenBSD.
A decade later, I also tested it on Raspbian and Armbian; it works
fine in ARM machines as well. Autotools also allow me to cross-compile
for Windows (more on that below).
- After getting acquainted with OpenMP, in Nov 2009 I added OpenMP #pragmas
to run both the C and the SSE code in all cores/CPUs. The SSE code had to
be moved from a separate assembly file into inlined code - but the effort
was worth it. The resulting frame rate - on a tiny Atom 330 running Arch
Linux - sped up from 58 to 160 frames per second.
- I then coded it in CUDA - a 75$ GPU card gave me almost two orders of
magnitude of speedup!
- Then in May 2011, I made the code switch automatically from single precision
floating point to double precision - when one zooms "deep enough".
- Around 2012 I added a significant optimization: avoiding fully calculating
the Mandelbrot lake areas (black color) by drawing at 1/16 resolution and
skipping black areas in the full resolution render.
- I learned enough VHDL in 2018 to [code the algorithm inside a Spartan3
FPGA](https://www.youtube.com/watch?v=yFIbjiOWYFY). That was quite a
[learning experience](https://github.com/ttsiodras/MandelbrotInVHDL).
- In September 2020 I [ported a fixed-point arithmetic](
https://github.com/ttsiodras/Blue_Pill_Mandelbrot/) version of the
algorithm [inside a 1.4$ microcontroller](
https://www.youtube.com/watch?v=5875JOnFDLg).
- In October 2020, I implemented what I understood to be the XaoS algorithm;
that is, re-using pixels from the previous frame to optimally update
the next one. Especially in deep-dives and large windows, this delivered
amazing speedups; between 2 and 3 orders of magnitude.
- In July 2022, I optimised further with AVX instructions (+80% speed
in CoreLoopDouble). I also ported the code to libSDL2, which stopped
video tearing.
FOR CODERS ONLY
===============
My SSE code
-----------
This used to be my main loop, right after I ported to SSE back in 2002:
; x' = x^2 - y^2 + a
; y' = 2xy + b
;
mov ecx, 0
movaps xmm5, [fours] ; 4. 4. 4. 4. ; xmm5
movaps xmm6, [re] ; a0 a1 a2 a3 ; xmm6
movaps xmm7, [im] ; b0 b1 b2 b3 ; xmm7
xorps xmm0, xmm0 ; 0. 0. 0. 0.
xorps xmm1, xmm1 ; 0. 0. 0. 0.
xorps xmm3, xmm3 ; 0. 0. 0. 0. ; xmm3
loop1:
movaps xmm2, xmm0 ; x0 x1 x2 x3 ; xmm2
mulps xmm2, xmm1 ; x0*y0 x1*y1 x2*y2 x3*y3 ; xmm2
mulps xmm0, xmm0 ; x0^2 x1^2 x2^2 x3^2 ; xmm0
mulps xmm1, xmm1 ; y0^2 y1^2 y2^2 y3^2 ; xmm1
movaps xmm4, xmm0
addps xmm4, xmm1 ; x0^2+y0^2 x1... ; xmm4
subps xmm0, xmm1 ; x0^2-y0^2 x1... ; xmm0
addps xmm0, xmm6 ; x0' x1' x2' x3' ; xmm0
movaps xmm1, xmm2 ; x0*y0 x1*y1 x2*y2 x3*y3 ; xmm1
addps xmm1, xmm1 ; 2x0*y0 2x1*y1 2x2*y2 2x3*y3 ; xmm1
addps xmm1, xmm7 ; y0' y1' y2' y3' ; xmm1
cmpltps xmm4, xmm5 ; <4 <4 <4 <4 ? ; xmm2
movaps xmm2, xmm4
; at this point, xmm2 has all 1s in the non-overflowed pixels
movmskps eax, xmm4 ; (lower 4 bits reflect comparisons)
andps xmm4, [ones] ; so, prepare to increase the non-over
addps xmm3, xmm4 ; by updating the 4 bailout counters
or eax, eax ; have all 4 pixels overflowed ?
jz short nomore ; yes, we're done
inc ecx
cmp ecx, ITERATIONS
jnz short loop1
The new AVX code (inside CoreLoopDoubleAVX) follows the same motif;
except that it also includes periodicity checking, and uses the YMM
registers.
The comments should help you follow what's happening... Basically,
we compute 4 pixels at a time.
XaoS
----
The idea behind the XaoS algorithm is simple: don't redraw the pixels;
instead re-use as many as you can from the previous frame.
The devil, as ever, is in the details.
The way I implemented this is as follows: the topmost scaline goes
from X coordinate `xld` to `xru` - in `xstep` steps (see code
for details). I store these computed coordinates in array `xcoord`;
and in the next frame, I compare the new coordinates with the old
ones. For each pixel, I basically find the closest X coordinate match.
I do the same for the Y coordinates. In both cases, we are talking
about a 1-dimensional array, of MAXX or MAXY length.
After I have the matches, I sort them - based on distance to the
coordinates of the previous frame. The `mandel` function then forces
a redraw for the worst N columns/rows - where N comes as a percentage
parameter in the function call. Simply put, if the pixel's
X and Y coordinates fall on "slots" that are close enough to the
old frame's `xcoord` and `ycoord`, the pixel color is taken
from the previous frame without doing the expensive Mandelbrot
calculation.
This works perfectly - the zoom becomes nice and smooth, and is
also improved with a full Mandelbrot render the moment the user
stops zoooming.
The code has a lot of comments explaining the inner-workings in detail.
Have a look!
Cross compiling for Windows via MinGW
-------------------------------------
After decompressing the SDL 2.0.22 tarball, install MinGW:
$ sudo apt install gcc-mingw-w64
Then download the source code of libSDL and compile it as follows:
$ cd SDL-2.0.22
$ ./configure --host=x86_64-w64-mingw32 \
--disable-video-x11 --disable-x11-shared \
--prefix=/usr/local/packages/SDL-2.0.22-win32
$ make
$ sudo make install
Finally, come back to this source folder, and configure it like this:
$ ./configure --host=x86_64-w64-mingw32 \
--with-sdl-prefix=/usr/local/packages/SDL-2.0.22-win32 \
--disable-sdltest
$ make
$ cp src/mandelSSE.exe \
/usr/local/packages/SDL-2.0.22-win32/bin/SDL2.dll \
/some/path/for/Windows/
You can also get the "ingredients" (DLLs for SDL2, OpenMP, libstd++, etc)
from the packaged release
[here](https://github.com/ttsiodras/MandelbrotSSE/releases/download/2.11/mandelSSE-win32-2.11.zip).
MISC
====
Since it reports frame rate at the end (option `-b`), you can use this as
a benchmark for AVX instructions - it puts the AVX registers under quite a load.
I've also coded a
[CUDA version](https://www.thanassis.space/mandelcuda-1.0.tar.bz2),
which you can play with, if you have an NVIDIA card.
Some details about it, in the blog post I wrote back in 2009 about
it [here](https://www.thanassis.space/mandelSSE.html).