-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrss.xml
520 lines (495 loc) · 60.8 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>GeNN (Software Developer Blog)</title><link>http://genn-team.github.io/</link><description>Welcome to GeNN</description><atom:link href="http://genn-team.github.io/rss.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2024 <a href="mailto:[email protected]">GeNN Team</a> </copyright><lastBuildDate>Thu, 07 Nov 2024 17:17:52 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Software Developer Blog: Running away</title><link>http://genn-team.github.io/posts/running-away.html</link><dc:creator>GeNN Team</dc:creator><description><p>After spending a long time updating GeNN's code generator to generate more efficient CUDA kernels which have the side benefit of compiling much more quickly, there remained something of a dirty secret.
The <code>runner.cc</code> file which contains the helper functions generated by GeNN for allocating memory and copying variables between GPU and CPU could still easily grow to the point that compilation would take an extremely long time and consume all available memory.
For our <a href="https://github.com/neworderofjamie/multi-area-model/">multi-area model implementation</a>, I added various options which turn off the generation of empty functions and, as everything in this model was generated on the GPU anyway, I also turned off the generation of host copies of almost all variables.
This resulted in a paltry 40 mbyte <code>runner.cc</code> which compiled in a couple of minutes which, for a model this size, is just about acceptable.
However, as users have started making bigger models and not always wanting to generate everything on the GPU, this issue has kept reappearing.</p>
<h2>Jinjaly investigating</h2>
<p>To investigate this in a slightly simpler way than just building larger and larger GeNN models until things break, I used <a href="https://jinja.palletsprojects.com">Jinja</a> to build a template that could generate fake <code>runner.cc</code> files containing varying number of arrays, representing the state variables in a real model.
The heart of this template looked something like this:</p>
<div class="code"><pre class="code literal-block"><span class="c1">// Push and pull functions</span>
<span class="p">{</span><span class="o">%</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">array</span><span class="w"> </span><span class="n">in</span><span class="w"> </span><span class="n">arrays</span><span class="w"> </span><span class="o">%</span><span class="p">}</span>
<span class="kt">void</span><span class="w"> </span><span class="n">push</span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}}</span><span class="n">ToDevice</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">CHECK_CUDA_ERRORS</span><span class="p">(</span><span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">d_</span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}},</span><span class="w"> </span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}},</span><span class="w"> </span>
<span class="w"> </span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">size</span><span class="p">}}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span><span class="w"> </span>
<span class="w"> </span><span class="n">cudaMemcpyHostToDevice</span><span class="p">));</span>
<span class="p">}</span>
<span class="kt">void</span><span class="w"> </span><span class="n">pull</span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}}</span><span class="n">FromDevice</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">CHECK_CUDA_ERRORS</span><span class="p">(</span><span class="n">cudaMemcpy</span><span class="p">({{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}},</span><span class="w"> </span><span class="n">d_</span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}},</span><span class="w"> </span>
<span class="w"> </span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">size</span><span class="p">}}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span><span class="w"> </span>
<span class="w"> </span><span class="n">cudaMemcpyDeviceToHost</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">{</span><span class="o">%</span><span class="w"> </span><span class="n">endfor</span><span class="w"> </span><span class="o">%</span><span class="p">}</span>
<span class="kt">void</span><span class="w"> </span><span class="n">allocateMem</span><span class="p">()</span><span class="w"> </span>
<span class="p">{</span>
<span class="w"> </span><span class="n">CHECK_CUDA_ERRORS</span><span class="p">(</span><span class="n">cudaSetDevice</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
<span class="w"> </span><span class="p">{</span><span class="o">%</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">array</span><span class="w"> </span><span class="n">in</span><span class="w"> </span><span class="n">arrays</span><span class="w"> </span><span class="o">%</span><span class="p">}</span>
<span class="w"> </span><span class="n">CHECK_CUDA_ERRORS</span><span class="p">(</span><span class="n">cudaHostAlloc</span><span class="p">(</span><span class="o">&amp;</span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}},</span><span class="w"> </span>
<span class="w"> </span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">size</span><span class="p">}}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span><span class="w"> </span>
<span class="w"> </span><span class="n">cudaHostAllocPortable</span><span class="p">));</span>
<span class="w"> </span><span class="n">CHECK_CUDA_ERRORS</span><span class="p">(</span><span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">d_</span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">name</span><span class="p">}},</span><span class="w"> </span>
<span class="w"> </span><span class="p">{{</span><span class="n">array</span><span class="p">.</span><span class="n">size</span><span class="p">}}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">)));</span>
<span class="w"> </span><span class="p">{</span><span class="o">%</span><span class="w"> </span><span class="n">endfor</span><span class="w"> </span><span class="o">%</span><span class="p">}</span><span class="w"> </span>
<span class="p">}</span>
</pre></div>
<p>this template (saved in <code>runner.cc.template</code>) could then be used to generate C++ and print it to stdout like:</p>
<div class="code"><pre class="code literal-block"><span class="kn">from</span> <span class="nn">jinja2</span> <span class="kn">import</span> <span class="n">Template</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"runner.cc.template"</span><span class="p">,</span> <span class="s2">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">template</span> <span class="o">=</span> <span class="n">Template</span><span class="p">(</span><span class="n">file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="n">arrays</span> <span class="o">=</span> <span class="p">[{</span><span class="s2">"name"</span><span class="p">:</span> <span class="sa">f</span><span class="s2">"array_</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s2">"</span><span class="p">,</span> <span class="s2">"size"</span><span class="p">:</span> <span class="mi">1000</span><span class="p">}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_arrays</span><span class="p">)]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">template</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">arrays</span><span class="o">=</span><span class="n">arrays</span><span class="p">))</span>
</pre></div>
<p>On Linux, the C++ could then be built using the same command line used by GeNN itself (some options omitted for brevity) and timed using the <code>/usr/bin/time</code> (I recently discovered that <code>command time</code> can be used as an alternative way of disambiguating this from the bash builtin <code>time</code>):</p>
<div class="code"><pre class="code literal-block">/usr/bin/time<span class="w"> </span>-v<span class="w"> </span>nvcc<span class="w"> </span>-c<span class="w"> </span>-x<span class="w"> </span>cu<span class="w"> </span>-arch<span class="w"> </span>sm_86<span class="w"> </span>-std<span class="o">=</span>c++11<span class="w"> </span>test.cc
</pre></div>
<p>Initial experiments showed that while both the wall clock time and maximum resident set size (roughly analagous to peak physical memory usage) grow approximately linearly (much to my relief after reading way too many <a href="https://randomascii.wordpress.com/category/quadratic/">horror stories</a> on Bruce Dawson's <a href="https://randomascii.wordpress.com">excellent blog</a>) with the number of arrays, it still grew extremely rapidly:</p>
<p><img alt="Relationship between runner size; and compile time and memory usage" src="http://genn-team.github.io/images/blog_running_away/fig_1_linearity.png"></p>
<p>Therefore, a model with 10000 arrays will take over 4 minutes and around 8 gbyte of memory to compile — neither of which are really acceptable.
To put this in perspective, if you split a model up into about 100 populations and connect most of the permutations together (this is an all-too-reasonable assumption in many areas of the mammalian brain), you could easily reach this many variables.</p>
<p>So....what is NVCC doing with all this time and memory?
<code>runner.cc</code> only contains host code (NVCC is just used to ensure the same compiler/options across execution units and to deal with setting up the CUDA linker/include paths) but, when you pass a 5 mbyte <code>runner.cc</code> file to NVCC, the file that is passed on to the host compiler (GCC) has grown to 15 mbyte!
However, this turned out to be simply because NVCC is in charge of running the preprocessor so that 10 mbyte is 'just' the result of expanding macros and including C++ standard library header files!</p>
<h2>Profiling</h2>
<p>Around this point, I remembered reading a blog post about <a href="https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/">profiling compiler times</a> on yet another <a href="https://aras-p.info/blog/">excellent blog</a> and turned on the <code>-ftime-report</code> GCC option.
As the blog promised, this generates a gargantuan report which starts by splitting the compilation time of this 10000 array model into 'phases':</p>
<div class="code"><pre class="code literal-block">phase setup : ... 0.00 ( 0%) wall 1384 kB ( 0%)
phase parsing : ... 20.81 ( 9%) wall 1794944 kB (25%)
phase lang. deferred : ... 0.02 ( 0%) wall 2426 kB ( 0%)
phase opt and generate : ... 214.14 (91%) wall 5412439 kB (75%)
phase finalize : ... 0.54 ( 0%) wall 0 kB ( 0%)
</pre></div>
<p>Somewhat surprisingly (as, after all, we're throwing a massive source file at GCC), the vast majority of time is spent in "opt and generate" (code-generation and optimisation) rather than in parsing.
Looking a little further down, where the report contains a seeminly unsorted list of processes within phases, the only other 'hot' line is:</p>
<div class="code"><pre class="code literal-block">expand vars : ... 101.93 (43%) wall 50597 kB ( 1%)
</pre></div>
<p>but, as a non-GCC developer, this doesn't help me a great deal....back to pursuing random hunches!</p>
<h2>Smoking gun</h2>
<p>As there's no virtual functions in this code, my C++ prejudices suggest that only exceptions could <em>possibly</em> be to blame and, as each of those <code>CHECK_CUDA_ERRORS</code> macros hides a <code>throw std::runtime_error</code>, maybe that's not unreasonable.
Generating all that zero-cost abstraction must involve <em>expanding</em> a lot of <em>variables</em>....right!?
How about if we replace our current implemementation of <code>CHECK_CUDA_ERRORS</code>:</p>
<div class="code"><pre class="code literal-block"><span class="cp">#define CHECK_CUDA_ERRORS(call) {\</span>
<span class="cp"> cudaError_t error = call;\</span>
<span class="cp"> if(error != cudaSuccess) {\</span>
<span class="cp"> throw std::runtime_error(__FILE__": " \</span>
<span class="cp"> + std::to_string(__LINE__) \</span>
<span class="cp"> + ": cuda error " \</span>
<span class="cp"> + std::to_string(error) \</span>
<span class="cp"> + ": " + cudaGetErrorString(error));\</span>
<span class="cp"> }\</span>
<span class="cp">}</span>
</pre></div>
<p>with:</p>
<div class="code"><pre class="code literal-block"><span class="cp">#define CHECK_CUDA_ERRORS(call) {\</span>
<span class="cp"> cudaError_t error = call;\</span>
<span class="cp"> assert(error == cudaSuccess); \</span>
<span class="cp">}</span>
</pre></div>
<p>or even:</p>
<div class="code"><pre class="code literal-block"><span class="cp">#define CHECK_CUDA_ERRORS(call) {\</span>
<span class="cp"> cudaError_t error = call;\</span>
<span class="cp"> if(error != cudaSuccess) {\</span>
<span class="cp"> std::abort();\</span>
<span class="cp"> }\</span>
<span class="cp">}</span>
</pre></div>
<p>Some template-meddling and sweeping later we can produce:</p>
<p><img alt="Relationship between number of arrays and error handling approach; and compile time and memory usage" src="http://genn-team.github.io/images/blog_running_away/fig_2_alternate_error.png"></p>
<p>Seems like this actually works! Our 10000 array model now only takes 30 seconds and less than 1 gbyte of memory to compile which is much more reasonable!</p>
<p>However, are the exceptions <em>really</em> to blame?
Compile times seem much better when using the single-threaded CPU backend and that allocates memory with <code>new[]</code> (obviously, modern C++ rules don't apply in generated code...) which throws <code>std::bad_alloc</code> to signal failure.
Admittedly, because there's no need to copy data when everthing's on the CPU, this backend generates empty 'push' and 'pull' functions so there's less code to compile overall but, if generating exception handling code was the problem, you would expect issues here too.
Maybe expanding all that message-generating code is the real issue...
How about we hack the following additional variants into the template:</p>
<div class="code"><pre class="code literal-block"><span class="cp">#define CHECK_CUDA_ERRORS(call) {\</span>
<span class="cp"> cudaError_t error = call;\</span>
<span class="cp"> if(error != cudaSuccess) {\</span>
<span class="cp"> throw std::runtime_error();\</span>
<span class="cp"> }\</span>
<span class="cp">}</span>
</pre></div>
<p>and</p>
<div class="code"><pre class="code literal-block"><span class="cp">#define CHECK_CUDA_ERRORS(call) {\</span>
<span class="cp"> cudaError_t error = call;\</span>
<span class="cp"> if(error != cudaSuccess) {\</span>
<span class="cp"> std::cerr &lt;&lt; __FILE__ &lt;&lt; ": " &lt;&lt; __LINE__;\</span>
<span class="cp"> std::cerr &lt;&lt; ": cuda error " &lt;&lt; error &lt;&lt; ": ";\</span>
<span class="cp"> std::cerr &lt;&lt; cudaGetErrorString(error) &lt;&lt; std::endl;\</span>
<span class="cp"> std::abort();\</span>
<span class="cp"> }\</span>
<span class="cp">}</span>
</pre></div>
<p>and sweep:</p>
<p><img alt="Relationship between number of arrays and error handling approach; and compile time and memory usage" src="http://genn-team.github.io/images/blog_running_away/fig_3_more_alternate_error.png"></p>
<p><code>std::abort</code> is definitely easier on the compiler than throwing exceptions but, compiling the message generation code also seems to make a large difference.</p>
<h2>MSVC</h2>
<p>The other compiler I often use with GeNN is Microsoft Visual C++.
I can't quite face repeating this whole process again but, initial tests suggest that this optimisation is even more valuable here.
Using the simplest <code>std::abort</code> raising <code>CHECK_CUDA_ERRORS</code> macro, the 10000 array model can be compiled in around 19 seconds whereas, using the original exception-throwing macro...I have given up waiting after around 1 hour!</p>
<h2>Implementing a workaround</h2>
<p>GeNN is approaching the end of the 4.X release cycle so, for now, I have added a simple but slightly hacky workaround for these issues by adding a <code>generateSimpleErrorHandling</code> flag to GeNN's CUDA backend to switch from generating code with the previous full-fat <code>CHECK_CUDA_ERRORS</code> macro to the simplest version which simply calls <code>std::abort</code> without generating a message. This can be turned on from C++ like:</p>
<div class="code"><pre class="code literal-block"><span class="kt">void</span><span class="w"> </span><span class="nf">modelDefinition</span><span class="p">(</span><span class="n">NNmodel</span><span class="w"> </span><span class="o">&amp;</span><span class="n">model</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="p">...</span>
<span class="w"> </span><span class="n">GENN_PREFERENCES</span><span class="p">.</span><span class="n">generateSimpleErrorHandling</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">true</span><span class="p">;</span>
<span class="w"> </span><span class="p">...</span>
<span class="p">}</span>
</pre></div>
<p>or from Python like:</p>
<div class="code"><pre class="code literal-block"><span class="n">model</span> <span class="o">=</span> <span class="n">GeNNModel</span><span class="p">(</span><span class="s2">"float"</span><span class="p">,</span> <span class="s2">"my_model"</span><span class="p">,</span> <span class="n">generateSimpleErrorHandling</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
<h2>Real models</h2>
<p>The largest model we currently have to play with with is the <a href="https://github.com/neworderofjamie/multi-area-model/">multi-area cortical model</a>.
Although it has 64516 synapse groups, due to its use of procedural connectivity (where all synaptic connectivity, weights and delays are generated on the fly), it doesn't actually have any per-synapse group variables with push and pull functions.
Nonetheless, using the new simple error handling reduces the compilation time of the <code>runner.cc</code> from 155 to 129 seconds.
Finally, although it is not possible to <em>run</em> the model in this way as no single GPU has enough memory, we can generate a <code>runner.cc</code> from this model with standard, in-memory sparse connectivity and push and pull functions for each variable.
This results in a nightmarish, 114 mbyte <code>runner.cc</code> which, using the original <code>CHECK_CUDA_ERRORS</code> macro, would definitely be impossible to compile on any reasonable machine.
However, using the new simplified macro, the runner can be compiled in just over 20 minutes and requires just over 14 gbyte of memory — still pretty unusable but definitely progress!</p>
<h2>Long-term solutions</h2>
<p>The majority of the time, the errors which the <code>CHECK_CUDA_ERRORS</code> macro is aiming to catch are out of memory errors in the <code>allocateMem</code> function and errors that occured during (asynchronous) kernel launches that are only caught at the next push or pull call (which are typically the main synchronisation points) so perhaps, in future, we could adopt a more targetted error-handling approach which provides a balance between sufficient debugging information and compilation time.</p>
<p>However, while the error handling changes discussed here allow the current approach to generating <code>runner.cc</code> files to scale a bit further, the code we are generating is still pretty pathological, least of all because <a href="https://github.com/genn-team/genn/issues/408">the Windows PE executable format has a limit of 65535 symbol limit</a> which you can hit quite easily with a large model.
Early this year, I made an attempt at re-writing the code generator to apply the same merging strategy GeNN uses elsewhere to <code>runner.cc</code>.
This means that all the variables associated with neuron and synapse populations with the same types of state variable can be allocated using one piece of shared generated code.
While this works, it adds yet more complexity to GeNN and fundamentally breaks the 'classic' way of using GeNN from C++, where you link some C++ simulation code against your generated code and can access state variables directly by name.
However, based on this investigation, maybe that project needs resurrecting!</p>
<p>All the code I've developed to explore this problem is available from <a href="https://github.com/neworderofjamie/nvcc_breaker">my Github</a>.</p></description><guid>http://genn-team.github.io/posts/running-away.html</guid><pubDate>Tue, 01 Nov 2022 13:35:07 GMT</pubDate></item><item><title>Software Developer Blog: How to do convolutions with doubly blocked Toeplitz matrices</title><link>http://genn-team.github.io/posts/sw_blog_toeplitz.html</link><dc:creator>GeNN Team</dc:creator><description><h2>How to do convolutions with doubly blocked Toeplitz matrices</h2>
<p>A few weeks ago, Jamie (@neworderofjamie) asked me on the chat whether I knew what doubly blocked Toeplitz matrices are and how they implement convolutions. I had no clue. Since then we have implemented convolutions using doubly blocked Toeplitz matrices in GeNN and found them to be extremely useful and efficient.
1
In this software blog I will give a brief overview on the why and how convolutions relate to doubly blocked Toeplitz matrices. My blog is based on Ali Salehi's tutorial <a href="https://raw.githubusercontent.com/alisaaalehi/convolution_as_multiplication/master/ConvAsMulExplained.pdf">Convolution as Matrix Multiplication</a> but updated to use machine-learning rather than signal-processing conventions and I am trying to avoid using too many unusual ways of re-arranging rows and columns.</p>
<h3>The why</h3>
<p>Let us consider the convolution of a \(2\times 2\) kernel with a \(3\times 3\) layer. We denote the kernel as
\[
K= \left(\matrix{
k_{11} &amp; k_{12} \cr
k_{21} &amp; k_{22}}\right)
\]
and the layer as
\[
I= \left(\matrix{
i_{11} &amp; i_{12} &amp; i_{13} \cr
i_{21} &amp; i_{22} &amp; i_{23} \cr
i_{31} &amp; i_{32} &amp; i_{33}
} \right).
\]
Then the convolution in the machine learning use of the term is calculating the cross-correlation of the kernel "moving across" the layer as illustrated below. The layer \(I\) is in blue, the kernel \(K\) in grey and the result \(R\) in green.</p>
<table>
<thead>
<tr>
<th style="text-align: center;"><sup id="fnref:1"><a class="footnote-ref" href="http://genn-team.github.io/posts/sw_blog_toeplitz.html#fn:1">1</a></sup><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_00.png"></th>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_01.png"></th>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_02.png"></th>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_03.png"></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center;">\(r_{11}\)</td>
<td style="text-align: center;">\(r_{12}\)</td>
<td style="text-align: center;">\(r_{13}\)</td>
<td style="text-align: center;">\(3_{14}\)</td>
</tr>
</tbody>
</table>
<p>For the first non-zero entry at \((1,1)\) of the result matrix \(R\), we therefore have \(r_{11} = k_{22} i_{11}\).
Then the kernel moves one over and \(r_{12} = k_{21}i_{11} + k_{22} i_{12}\). Then, \(r_{13} = k_{21}i_{12} + k_{22} i_{13}\) and \(r_{14} = k_{21}i_{13} \).</p>
<table>
<thead>
<tr>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_04.png"></th>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_05.png"></th>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_06.png"></th>
<th style="text-align: center;"><img alt="Illustration of convolution step" src="http://genn-team.github.io/images/blog_07.png"></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center;">\(r_{21}\)</td>
<td style="text-align: center;">\(r_{22}\)</td>
<td style="text-align: center;">\(r_{23}\)</td>
<td style="text-align: center;">\(r_{24}\)</td>
</tr>
</tbody>
</table>
<p>So, for the second row,
\(r_{21} = k_{12} i_{11} + k_{22} i_{21} \), move one over,
\(r_{22} = k_{11} i_{11} + k_{12} i_{12} + k_{21} i_{21} + k_{22} i_{22} \), one more to the right,
\(r_{23} = k_{11}i_{12} + k_{12} i_{13} + k_{21} i_{22} + k_{22} i_{23} \), and finally
\(r_{24} = k_{11}i_{13} + k_{21} i_{23} \).</p>
<p>It works similar for the remaining two rows.</p>
<p>If we unroll the layer \(I\) row-wise into a column vector \(I_\text{col}\),
\[
I_\text{col} =
\left(
\matrix{
i_{11} \cr
i_{12} \cr
i_{13} \cr
i_{21} \cr
i_{22} \cr
i_{23} \cr
i_{31} \cr
i_{32} \cr
i_{33}
}
\right),
\]
then we can express this as a matrix-vector multiplication of a matrix formed from the entries of the kernel \(K\) and the vector\(I_\text{col}\),
\[
\left(\matrix{
k_{22} &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \cr
k_{21} &amp; k_{22} &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \cr
0 &amp; k_{21} &amp; k_{22} &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; k_{21} &amp; k_{22} &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \cr
k_{12} &amp; 0 &amp; 0 &amp; k_{22} &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \cr
k_{11} &amp; k_{12} &amp; 0 &amp; k_{21} &amp; k_{22} &amp; 0 &amp; 0 &amp; 0 &amp; 0 \cr
0 &amp; k_{11} &amp; k_{12} &amp; 0 &amp; k_{21} &amp; k_{22} &amp; 0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; k_{11} &amp; 0 &amp; 0 &amp; k_{21} &amp; 0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; k_{12} &amp; 0 &amp; 0 &amp; k_{22} &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; k_{11} &amp; k_{12} &amp; 0 &amp; k_{21} &amp; k_{22} &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; 0 &amp; k_{11} &amp; k_{12} &amp; 0 &amp; k_{21} &amp; k_{22} \cr
0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; k_{11} &amp; 0 &amp; 0 &amp; k_{21} \cr
0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; k_{12} &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; k_{11} &amp; k_{12} &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; k_{11} &amp; k_{12} \cr
0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; 0 &amp; k_{11} }\right)
\cdot
\left(\matrix{
i_{11} \cr
i_{12} \cr
i_{13} \cr
i_{21} \cr
i_{22} \cr
i_{23} \cr
i_{31} \cr
i_{32} \cr
i_{33}}
\right)
\]</p>
<p>Now one can already see that the matrix formed from the kernel entries has a very peculiar shape - the shape of a doubly blocked Toeplitz matrix</p>
<h3>Doubly blocked Toeplitz matrix</h3>
<p>A Toeplitz matrix is a matrix where the values along all diagonals are constant, i.e.</p>
<p>\[
\left(
\matrix{
a_{0} &amp; a_{-1} &amp; a_{-2} &amp; \cdots &amp; \cdots &amp; \cdots &amp; a_{-(N-1)} \cr
a_{1} &amp; a_{0} &amp; a_{-1} &amp; a_{-2} &amp; &amp; &amp; \vdots \cr
a_{2} &amp; a_{1} &amp; a_{0} &amp; a_{-1} &amp; &amp; &amp; \vdots \cr
\vdots &amp; \ddots &amp; \ddots &amp; \ddots &amp; \ddots &amp; \ddots &amp; &amp; \vdots \cr
\vdots &amp; &amp; &amp; \ddots &amp; a_{0} &amp; a_{-1} &amp; a_{-2} \cr
\vdots &amp; &amp; &amp; &amp; a_{1} &amp; a_{0} &amp; a_{-1} \cr
a_{M-1} &amp; \cdots &amp; \cdots &amp; \cdots &amp; a_{2} &amp; a_{1} &amp; a_{0} }
\right) .
\]</p>
<p>Furthermore, if we build a matrix \(A\) out of Toeplitz sub-matrices \(A_{k}\) <em>and</em> the structure of \(A\) with respect to these submatrices is also Toeplitz:</p>
<p>\[
A = \left(
\matrix{
A_{0} &amp; A_{-1} &amp; \cdots &amp; A_{-(L-1)} \cr
A_{1} &amp; A_{0} &amp; \cdots &amp; A_{-(L-2)} \cr
\vdots &amp; \vdots &amp; \ddots &amp; \vdots \cr
A_{K} &amp; A_{K-1} &amp; \cdots &amp; A_{0}}
\right),
\]</p>
<p>then, this matrix is called a doubly-blocked Toeplitz matrix. A standard way to generate a Toeplitz matrix from a vector \(v\) is to use \(v\) as the first column vector, then make one cyclic permutation and use it as the second column vector and so on.</p>
<h3>The method</h3>
<p>As we have seen on the example above, 2D convolution operations can be expressed as multiplication by a doubly-blocked Toeplitz matrix. As a general method, applied to the example above,
to convolve \(K\) with \(I\), we first flip \(K\) across the horizontal and vertical axis and pad it to the output size \((I_\text{height} + K_\text{height} - 1) \times (I_\text{width} + K_\text{width} - 1)\) of the convolution.
For instance, here, the \(3 \times 3\) layer \(I\) covolved by \(K\) above, leads to output size \(4 \times 4\).
Depending on the padding mode used by the convolution, typically, only part of this output is actually required.
The flipped and padded kernel \(K\) from above is
\[
K_\text{pad}=
\left(
\matrix{
k_{22} &amp; k_{21} &amp; 0 &amp; 0 \cr
k_{12} &amp; k_{11} &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 &amp; 0 }
\right)
\]</p>
<p>We then convert each <em>row vector</em> of this matrix into Toeplitz matrices \(F_i\) as described above:
\[
F_0=
\left(
\matrix{
k_{22} &amp; 0 &amp; 0 \cr
k_{21} &amp; k_{22} &amp; 0 \cr
0 &amp; k_{21} &amp; k_{22} \cr
0 &amp; 0 &amp; k_{21}}
\right)
\quad
F_1=
\left(
\matrix{
k_{12} &amp; 0 &amp; 0 \cr
k_{11} &amp; k_{12} &amp; 0 \cr
0 &amp; k_{11} &amp; k_{12} \cr
0 &amp; 0 &amp; k_{11}}
\right)
\]
\[
F_2=
\left(
\matrix{
0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0}
\right)
\quad
F_3=
\left(
\matrix{
0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0 \cr
0 &amp; 0 &amp; 0}
\right)
\]
and, finally, assemble these into a doubly blocked Toeplitz matrix \(F\):</p>
<p>\[
F=
\left(
\matrix{
F_0 &amp; F_3 &amp; F_2 \cr
F_1 &amp; F_0 &amp; F_3 \cr
F_2 &amp; F_1 &amp; F_0 \cr
F_3 &amp; F_2 &amp; F_1
}
\right)
\]</p>
<p>The convolution of \(K\) with \(I\)
is then given by multiplying F from the left onto \(I_\text{col}\) as defined above,
\[
R_{\text{col}} = F \cdot I \quad
\Leftrightarrow \quad R_{\text{col},j}= \sum_i F_{ji}I_i
\]</p>
<p>Finally, \(R_{\text{col}}\) can be reinterpreted as the output matrix \(R\) by arranging its entries row-wise in a \(4\times 4\) matrix.</p>
<p>There we have it - convolution (in the machine learning sense, i.e. corss-correlation) of a kernel \(K\) with a layer \(I\) expressed as the product of a doubly blocked Toeplitz matrix derived from \(K\) with the column vector of the row-wise unrolled entries from \(I\).</p>
<p>The following python function is a simple implementation of this method</p>
<div class="code"><pre class="code literal-block"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy.linalg</span> <span class="kn">import</span> <span class="n">toeplitz</span>
<span class="k">def</span> <span class="nf">convolution</span><span class="p">(</span><span class="n">I</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span> <span class="kc">False</span><span class="p">):</span>
<span class="c1"># flip the kernel</span>
<span class="n">K</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">fliplr</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">flipud</span><span class="p">(</span><span class="n">K</span><span class="p">))</span>
<span class="c1"># calculate sizes</span>
<span class="n">K_row_num</span><span class="p">,</span> <span class="n">K_col_num</span><span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">shape</span>
<span class="n">I_row_num</span><span class="p">,</span> <span class="n">I_col_num</span><span class="o">=</span> <span class="n">I</span><span class="o">.</span><span class="n">shape</span>
<span class="n">R_row_num</span><span class="o">=</span> <span class="n">K_row_num</span><span class="o">+</span><span class="n">I_row_num</span><span class="o">-</span><span class="mi">1</span>
<span class="n">R_col_num</span><span class="o">=</span> <span class="n">K_col_num</span><span class="o">+</span><span class="n">I_col_num</span><span class="o">-</span><span class="mi">1</span>
<span class="c1"># pad the kernel</span>
<span class="n">K_pad</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">pad</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="p">((</span><span class="mi">0</span><span class="p">,</span><span class="n">R_row_num</span> <span class="o">-</span> <span class="n">K_row_num</span><span class="p">),</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">R_col_num</span> <span class="o">-</span> <span class="n">K_col_num</span><span class="p">)),</span>
<span class="s1">'constant'</span><span class="p">,</span> <span class="n">constant_values</span><span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"padded kernel= </span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span> <span class="n">K_pad</span><span class="p">)</span>
<span class="c1"># Assemble the list of Toeplitz matrices F_i</span>
<span class="n">toeplitz_list</span><span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">R_row_num</span><span class="p">):</span>
<span class="n">c</span><span class="o">=</span> <span class="n">K_pad</span><span class="p">[</span><span class="n">i</span><span class="p">,:]</span>
<span class="n">r</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">r_</span><span class="p">[</span><span class="n">c</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">I_col_num</span><span class="o">-</span><span class="mi">1</span><span class="p">)]</span>
<span class="n">toeplitz_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">toeplitz</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="n">r</span><span class="p">)</span><span class="o">.</span><span class="n">copy</span><span class="p">())</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Toeplitz list= </span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span> <span class="n">toeplitz_list</span><span class="p">)</span>
<span class="c1"># make a matrix with the indices of the block F_i </span>
<span class="c1"># of the doubly blocked Toeplitz matrix</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">R_row_num</span><span class="p">))</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">r_</span><span class="p">[</span><span class="n">c</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">c</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">:</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
<span class="n">doubly_indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">toeplitz</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="n">r</span><span class="p">)</span><span class="o">.</span><span class="n">copy</span><span class="p">())</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"doubly_indices= </span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span> <span class="n">doubly_indices</span><span class="p">)</span>
<span class="c1"># assemble the doubly blocked toeplitz matrix</span>
<span class="n">toeplitz_m</span><span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">R_row_num</span><span class="p">):</span>
<span class="n">row</span><span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">I_row_num</span><span class="p">):</span>
<span class="n">row</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">toeplitz_list</span><span class="p">[</span><span class="n">doubly_indices</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]])</span>
<span class="n">row</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="n">toeplitz_m</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="n">toeplitz_m</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">(</span><span class="n">toeplitz_m</span><span class="p">)</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Toeplitz matrix= </span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span><span class="n">toeplitz_m</span><span class="p">)</span>
<span class="c1"># make layer into column vector</span>
<span class="n">I_col</span><span class="o">=</span> <span class="n">I</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"I_col= "</span><span class="p">,</span> <span class="n">I_col</span><span class="p">)</span>
<span class="n">R</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">toeplitz_m</span><span class="p">,</span> <span class="n">I_col</span><span class="p">)</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'R as vector= </span><span class="se">\n</span><span class="s1">'</span><span class="p">,</span> <span class="n">R</span><span class="p">)</span>
<span class="n">R</span><span class="o">=</span> <span class="n">R</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">R_row_num</span><span class="p">,</span> <span class="n">R_col_num</span><span class="p">)</span>
<span class="k">if</span> <span class="n">verbose</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'R as matrix= </span><span class="se">\n</span><span class="s1">'</span><span class="p">,</span> <span class="n">R</span><span class="p">)</span>
<span class="k">return</span> <span class="n">R</span>
</pre></div>
<p>To test, one can, for instance, use</p>
<div class="code"><pre class="code literal-block"><span class="c1"># kernel</span>
<span class="n">K</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">10</span><span class="p">,</span><span class="mi">20</span><span class="p">],[</span><span class="mi">30</span><span class="p">,</span><span class="mi">40</span><span class="p">]])</span>
<span class="c1"># layer</span>
<span class="n">I</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],[</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">]])</span>
<span class="n">R</span><span class="o">=</span> <span class="n">convolution</span><span class="p">(</span><span class="n">I</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
</pre></div>
<p>The output would then be</p>
<div class="code"><pre class="code literal-block">padded kernel=
[[40 30 0 0]
[20 10 0 0]
[ 0 0 0 0]]
Toeplitz list=
[array([[40., 0., 0.],
[30., 40., 0.],
[ 0., 30., 40.],
[ 0., 0., 30.]]), array([[20., 0., 0.],
[10., 20., 0.],
[ 0., 10., 20.],
[ 0., 0., 10.]]), array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])]
doubly_indices=
[[0 2]
[1 0]
[2 1]]
Toeplitz matrix=
[[40. 0. 0. 0. 0. 0.]
[30. 40. 0. 0. 0. 0.]
[ 0. 30. 40. 0. 0. 0.]
[ 0. 0. 30. 0. 0. 0.]
[20. 0. 0. 40. 0. 0.]
[10. 20. 0. 30. 40. 0.]
[ 0. 10. 20. 0. 30. 40.]
[ 0. 0. 10. 0. 0. 30.]
[ 0. 0. 0. 20. 0. 0.]
[ 0. 0. 0. 10. 20. 0.]
[ 0. 0. 0. 0. 10. 20.]
[ 0. 0. 0. 0. 0. 10.]]
I_col= [1 2 3 4 5 6]
R as vector=
[ 40. 110. 180. 90. 180. 370. 470. 210. 80. 140. 170. 60.]
R as matrix=
[[ 40. 110. 180. 90.]
[180. 370. 470. 210.]
[ 80. 140. 170. 60.]]
</pre></div>
<p>Note, that this example is inspired by <a href="https://raw.githubusercontent.com/alisaaalehi/convolution_as_multiplication/master/ConvAsMulExplained.pdf">Salehi's tutorial</a> but because we are calculating the machine learning covolution (cross-correlation) and Salehi the mathematical convolution as used in signal processing, the results are not the same. To generate identical results one can use the doubly flipped kernel,</p>
<div class="code"><pre class="code literal-block"><span class="c1"># kernel</span>
<span class="n">K</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">40</span><span class="p">,</span><span class="mi">30</span><span class="p">],[</span><span class="mi">20</span><span class="p">,</span><span class="mi">10</span><span class="p">]])</span>
<span class="c1"># layer</span>
<span class="n">I</span><span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],[</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">]])</span>
<span class="n">R</span><span class="o">=</span> <span class="n">convolution</span><span class="p">(</span><span class="n">I</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span> <span class="kc">False</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"R= </span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span> <span class="n">R</span><span class="p">)</span>
</pre></div>
<p>and obtain</p>
<div class="code"><pre class="code literal-block"> <span class="n">R</span><span class="o">=</span>
<span class="p">[[</span> <span class="mf">10.</span> <span class="mf">40.</span> <span class="mf">70.</span> <span class="mf">60.</span><span class="p">]</span>
<span class="p">[</span> <span class="mf">70.</span> <span class="mf">230.</span> <span class="mf">330.</span> <span class="mf">240.</span><span class="p">]</span>
<span class="p">[</span><span class="mf">120.</span> <span class="mf">310.</span> <span class="mf">380.</span> <span class="mf">240.</span><span class="p">]]</span>
</pre></div>
<p>which exactly is Salehi's result.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Convolution images created with software from:
Vincent Dumoulin and Francesco Visin, A guide to convolution arithmetic for deep learning (2016) ArXiv e-prints 1603.07285; <a href="https://github.com/vdumoulin/conv_arithmetic">Software on github</a> <a class="footnote-backref" href="http://genn-team.github.io/posts/sw_blog_toeplitz.html#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div></description><category>convolution</category><category>math</category><category>toeplitz</category><guid>http://genn-team.github.io/posts/sw_blog_toeplitz.html</guid><pubDate>Tue, 21 Dec 2021 14:39:44 GMT</pubDate></item></channel></rss>