Skip to content

MLP Forward Propagation on DFE: Results and Optimization

Yannick Goumaz edited this page Jul 19, 2023 · 31 revisions

Goal

The structure of the DFE implementation of the deep learning inference is detailed on the following page: MLP Forward Propagation on DFE: Structure. This design is not optimized and has very large execution time, far away from the CPU and from the best reachable FPGA performance. These results are shown in the next section named "Basic design".

This page shows the different approaches and solutions to optimize the application and improve the execution time. Each section explains the new approach used and the new intermediate results obtained.

Basic design

The table below shows the time in seconds to execute the forward propagation using the basic adaptation. It shows the results for different number of images.

The times can be predicted very accurately. The number of ticks is known and has to be set in the manager (see Engine Interface). The FPGAs on Jumax are running at the frequency of 100 MHz by default. This means that the time to execute a program is simply equal to nbTicks / 100 MHz. The theoretical time for the maximal number of images (60'000) is shown in the result tables.

Result ID 10'000 images 30'000 images 60'000 images Theoretical time
0 60.28s 179.31s 364.64s 365.87s

We can observe that the value obtained in practice is very close to the theoretical one. In fact, it is even a bit smaller. This is because the theoretical calculation is naive and sums the execution of each layer sequentially. In reality the blocks in the manager work asynchronously. This means that output values from the first kernel are buffered inside FIFO memory blocks until it can be used by the second one. Once a complete vector is available the second kernel computes the first dot product. This means that in reality the time is a bit lower than the one computed in the table.

The table below shows the percentage of logic used out of the total available on the engine.

Result ID LookUp Tables (LUTs) FlipFlops (FFs) Digital Signal Processors (DSPs) Block memory (BRAM)
0 1.9% 1.8% 0.58% 4.14%

The logic usage is one of the key point when optimizing a DFE application, because it is a limiting factor with the I/O transfer speed. Once the FPGA is fully used, it means that our design cannot be optimized further or that the parallelization should be done in lighter way (by using different optimization techniques for example).

Here the logic used is very low. The LookUp tables (LUTs) are used for logic operations like OR, XOR and equal but also for additions. 1 LUT is required to operate on 1 bit. This means that adding 2 fixed point numbers represented by 32 bits is taking 32 LUTs.

FlipFlops (FFs) are used to form registers to store a value in the pipeline. This is useful if a value must be used after some delay because it waits for another operation to complete. To store a 32 bits value, 32 FFs are required, i.e. 1 FF per bit.

Digital Signal Processors (DSPs) are microprocessor units used to perform multiplications. To multiply two 32 bits fixed point numbers, 4 DSP are used.

The Block memory (BRAM) are units of 36KBits memory on the FPGA. They are used for the Fast Memory (FMEM) declared in the kernels and for FIFO blocks in the manager to buffer data. Each memory declared in the kernel has two read and two write ports. If more ports are needed than available at a given clock cycle, the memory is duplicated.

For the basic implementation, the logic usage is very low, which means that there is room for parallelization.

Input vectors

A first very intuitive possible parallelization is using vector type as input. MaxCompiler provides multiple composite types like vectors or complex numbers. Vectors are wrapper around data which allow to release multiple inputs at each clock cycle.

The original multiplication between one input and one weight becomes now a dot product between two vectors. The dot product is simply the multiplication of both vectors and then using a recursive sum over the elements of the vector to add them together. This operation can be performed during one single tick, at the cost of more logic usage. The global product detailed in the basic application is now grouped into multiple dot product parts which are computed in one single tick.

The inputs are now of a vector type defined with the following piece of code. Vector size must be a divider of the input size of one sample.

DFEVectorType<DFEVar> parallelVec = new DFEVectorType<DFEVar>(dfeFloat(8,24), vecSize);

The dot product is achieved using an external Kernel component:

DotProductKernel dp = new DotProductKernel(getKernel(), vecSize, dfeFloat(8,24));
dp.setInputs(inPort, weight);
mul = dp.getOutput();

The kernel code of the dot product is available here: DotProductKernel.maxj.

The table below shows the results of different input vector size for both layers compared to the previous basic implementation (first line of the table). The input dimension of vectors can be set to different values for each layer.

Result ID Input Dim1 Input Dim2 10'000 images 30'000 images 60'000 images Theoretical time
0 1 1 60.28s 179.31s 364.64s 365.87s
1 8 8 7.95s 22.45s 45.47s 45.73s
2 16 16 3.8s 11.35s 22.65s 22.86s

The result is clear: by parallelizing the input data by 8, the runtime is divided by a factor 8. With vectors of size 16, the execution time is 16 times smaller. This is because the computation time is much larger and dominating the I/O transfer and memory access times. As seen before the computation time can be computed using the number of ticks. In this case the number of ticks is simply divided by the input vector size.

Output vectors

With the previous optimization, multiple multiplications are performed each clock cycle. But this is only for one single output at a time. The allocated memory is implemented to reuse the inputs to compute the next output a few ticks later (see FMEM for multiple dot products). Another idea of parallelization is to take advantage of the fact that an input vector enters the pipeline from the memory to execute several dot products with different weights for several outputs at each clock cycle. This behavior can be achieved using a for-loop and accessing the correct weights in memory.

for (int r = 0; r < outVecSize; r++) {
  address = wcount + r*nbVecI.cast(dfeUInt(32));
    	
  // Read correct weight chunk in memory
  weight = wMem.read(address.cast(dfeUInt(MathUtils.bitsToAddress(nbWeightVec))));
    	
  // Perform dot product between input and weights vector
  DotProductKernel dp = new DotProductKernel(getKernel(), inVecSize, dfeFloat(8,24));
  dp.setInputs(inPort, weight);
  tmp = dp.getOutput();
    	
  // Add all corresponding dot products together
  DFEVar carriedSum = dfeFloat(8,24).newInstance(this);
  DFEVar sum = (w.eq(0)) ? constant.var(0).cast(dfeFloat(8,24)) : carriedSum;
  newSum = sum + tmp;	
  carriedSum.connect(stream.offset(newSum, -loopLatency));
    	
  // Add corresponding bias and apply activation function
  s[r] <== newSum + biases[r];
  x[r] <== tanh(s[r]);	
}

The outVecSize variable represents the number of simultaneous dot products with the same input. The address variable allows to jump to the correct memory chunk for the current input. MaxCompiler allows to access vectors using the [] operator.

The following table shows the result of applying the parallelization implemented above on multiple input configurations. The first three lines are taken from the previous results table and serve as a comparison with the first implementations.

Result ID Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
0 1 1 1 1 60.28s 179.31s 364.64s 365.87s
1 8 8 1 1 7.95s 22.45s 45.47s 45.73s
2 16 16 1 1 3.8s 11.35s 22.65s 22.86s
3 16 16 2 1 1.93s 5.7s 11.37s 11.58s
4 28 32 2 1 1.12s 3.29s 6.53s 6.59s
5 28 32 4 1 0.58s 1.69s 3.3s 3.37s

Again, the results are very predictable. Since the number of simultaneous dot products are known, the number of ticks decreases accordingly. The execution time is simply proportional to the output vector size.

The kernel and manager codes implementing output and input vectors can be found here: FLinearLayerFloatKernel.maxj and ForwardPropFloatMAX5CManager.maxj. The associated CPU file is here: float-forward-test.cpp.

Loop tiling

Before increasing further the size of input and output vectors, one of the main issues causing slower execution time should be considered, namely the latency offset. As seen before, the operations on floating point numbers are taking 12 ticks or more to perform. There exists solutions to counter the problem of this latency and compute a dot product each tick.

The first solution consists in loop tiling. This process aims at accessing to weights in a tile manner. This means that instead of computing dot products one after another, the latency counter is replaced by a larger counter representing the size of the tile (tile offset) we access. To facilitate the access on the kernel side, the weights are already reorganized on the CPU side (see the C++ code here: tile-forward-test.cpp).

Tile Access
Basic schema of the weights memory access using loop-tiling.
The grey arrows show the previous linear access and the red ones the new reading direction in a tile.
This figure shows 3 different blocks of size tile offset accessed sequentially.

This way the same input is used to directly compute multiple dot products, in the same fashion as the previous optimization with output vectors, but with the difference that they are not done at the same tick, but one after another. This technique simply replaces the latency counter by a way of performing one dot product at each tick instead of waiting for the operation offset. It is also important to choose a tile offset which is bigger than the floating point offset, because when the next input comes in the pipeline to sum the next dot product, the previous one should be finished, which would not be the case with a smaller tile offset. This also means that some padding should be added to the last layer, as the output size is smaller than the original latency.

This solution allows to remove the latency but the output parallelization seen in the last section is very limited with loop tiling and also really more complex to implement because of the non-intuitive disposition of weights in the memory and the additional padding. The next section explains another way to remove the latency without making the system more complex.

The following table shows the results obtained using loop tiling. Previous results are up to ID #5 and are used for comparison.

Result ID Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
0 1 1 1 1 60.28s 179.31s 364.64s 365.87s
4 28 32 2 1 1.12s 3.29s 6.53s 6.59s
5 28 32 4 1 0.58s 1.69s 3.3s 3.37s
6 28 32 16 12 0.22s 0.6s 1.15s 1.09s

We observe that the practical time is now higher than the theoretical one. The error due to asynchronous execution of each kernel is now negligible, but the impact of transferring parameters to the engine is more important and not taken in account in the calculation. The times obtained by loop-tiling are about 3 times better than those obtained previously, but, as described above, are limited for further improvements.

The implementation of loop tiling can be found in the following kernel and managers: FLinearLayerTileKernel.maxj and ForwardPropTileMAX5CManager.maxj.

Fixed point implementation

As already discussed several times, the cause of the latency between each new dot product comes from the operations done with numbers in floating point representation. One operation takes 12 ticks. One way to counter this latency would be to change the way numbers are represented. Indeed, operations on fixed-point numbers take only one clock cycle to be executed.

We can simply take the original code with floating point numbers that included vector parallelization in input and output (see previous sections Input vectors and Output vectors) and replace the type of the streams with a fixed point representation. A signed 32 bits number type including 16 bits for the fractional part can be declared the following way:

DFEFix fixedPtType = dfeFix(16, 16, SignMode.TWOSCOMPLEMENT);

As discussed in the CPU section on fixed point representation, using 16 bits for the fractional part is enough to reduce the loss of precision so that the results are equivalent to those in floating point.

The manager is also adapted to transmit streams of INT32 type instead of FLOAT:

private final CPUTypes cpuT = CPUTypes.INT32;

Result with ID#7 shows the execution time with fixed point representation. Other results were already previously presented and serve as comparison.

Result ID Type Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
0 float 1 1 1 1 60.28s 179.31s 364.64s 365.87s
5 float 28 32 4 1 0.58s 1.69s 3.3s 3.37s
6 float 28 32 16 12 0.22s 0.6s 1.15s 1.09s
7 fixed 28 32 4 1 0.094s 0.21s 0.35s 0.315s

Note that from ID #7, the theoretical time takes the offset of transferring data to the FPGA in account and neglects the second kernel computation as it is executed in parallel of the first one and is significantly smaller in computation size. The execution times becoming of the order of the millisecond, the order of magnitude of these factors are the same and it becomes important to take them correctly in account to accurately make time predictions.

The transfer of the data can be accurately predicted: it is simply the size of the data divided by the bandwith of the PCIe connection between the CPU and the FPGA. Here, we transmit 60000 images of 784 int32 numbers. The weights and bias are negligible compared to the image data. The image data is coded on 32 bits integers and the PCIe connection has a bandwidth of 32Gb/s. This means that the transfer of all images takes ~47ms.

We can observe that the time is approximately divided by a factor of 12 between the floating point implementation and the equivalent fixed point one, which is consistent with the latency duration. There is still an offset of ~30 ms between the theory and the real elapsed time that is discussed in a further section: Initialization offset.

Higher DFE clock frequency

The execution time can also be improved by simply increasing the frequency of the FPGA. The default clock frequency is 100 MHz. For the MAX5C engine, the frequency can be set up to 300 MHz for a simple design which takes less than 80% of the chip.

The following table shows the improvement in the results with higher clock frequencies.

Result ID Frequency Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
7 100 MHz 28 32 4 1 0.094s 0.21s 0.35s 0.315s
8 200 MHz 28 32 4 1 0.071s 0.13s 0.21s 0.18s
9 300 MHz 28 32 4 1 0.070s 0.118s 0.17s 0.13s
10 300 MHz 28 32 8 1 0.062s 0.096s 0.144s 0.091s

Only the computation time is divided by a factor of 2 for 200MHz and 3 for 300MHz so the total execution time (which includes the data transfer and the remaining offset discussed in section Initialization offset) isn't proportional to the frequency. The last result of the table (ID #10) is executed with a bigger output vector size.

Fixed point type size

The execution time is becoming shorter and shorter and the I/O transfer is slowly turning into a main bottleneck. The current implementation consists in normalizing the images on CPU side and sending them in fixed point representation to the DFE. As seen in previous sections, the integers are stored in 32 bits and it takes approximately 50ms for this transfer.

Originally, the images are represented by pixels in 8 bits. A way to decrease the transfer offset is to keep this storage size and to carry out the normalization on FPGA. This way, the data is sent in 8 bits integers, decreasing the transfer time by 75%.

A few modifications must be performed on the code to allow that, particularly because the data should be in 8 bits for the first layer but not for the next ones. An option would have been to create a specific normalization layer which only performs this operation but in this case the application is quiet simple and the modifications can directly be performed on existing kernels.

In the manager, the type of the input stream can be changed to UINT8 and the mean and standard deviation are sent as parameters:

InterfaceParam MU = ei.addParam("MU", CPUTypes.FLOAT);  // mean
InterfaceParam STD = ei.addParam("STD", CPUTypes.FLOAT); // standard deviation

ei.setStream("input", CPUTypes.UINT8, CPUTypes.UINT8.sizeInBytes() * SIZE_LAYER_0 * BS);

On the kernel side, a java condition allows to choose between 2 different types of inputs. The normalization is performed in the pipeline.

this.inVec = new DFEVectorType<DFEVar>(dfeUInt(8), vecSize);
this.parallelVec = new DFEVectorType<DFEVar>(fixedPtType, vecSize);

// Types differ in function of layer
if(firstLayer == 1){
  input = io.input(IN_NAME, inVec, ~readingW & h.eq(0) & l.eq(0));
  [...] }
else{
  input = io.input(IN_NAME, parallelVec, ~readingW & h.eq(0) & l.eq(0));
  [...] }

[...]

// Normalize input
DFEVector<DFEVar> normalInPort = (inPort.cast(parallelVec) - mu) / std;

The following table shows the gain in time between the previous best implementation and the new transfer type.

Result ID Input Type Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
10 INT32 28 32 8 1 0.062s 0.096s 0.144s 0.091s
11 INT8 28 32 8 1 0.057s 0.082s 0.112s 0.056s

We observe that the execution time is ~35ms shorter and that the transfer is now only taking ~11ms for 60'000 images. As mentioned previously, the large visible offset between observed and theoretical time will be discussed in the Initialization offset section.

Initialization offset

In the last few sections, a quiet large difference between the practical and theoretical values can be observed. This offset is neither linked to the I/O transfer, nor the computation as these are already taken in account. It turns out that there is an initialization offset on the engines. This latency is the same, whatever the size of the input stream. It is equal to 40-45 ms. It can easily be observed by using the DFE several times without unloading it from its .max file. In this case only the first execution has this offset, the following ones are very close to the theoretical value.

max_engine_t *max_engine;
max_engine = max_load(max_file, "*");

FixedForwardProp_run(max_engine, &actions); // Exec time = I/O transfer + computation + init
FixedForwardProp_run(max_engine, &actions); // Exec time = I/O transfer + computation
FixedForwardProp_run(max_engine, &actions); // Exec time = I/O transfer + computation

max_unload(max_engine);

Result #11a shows the execution time of the first ForwardProp_run() command. Result #11b shows the execution time of the second (and next) ForwardProp_run() command(s). The initialization latency is not visible anymore. There is still a small overhead caused by the interface between the CPU and the engine. This offset is related to the size of the input.

Result ID Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
11a 28 32 8 1 0.057s 0.082s 0.112s 0.056s
11b 28 32 8 1 0.014s 0.036s 0.061s 0.056s

The big initialization offset is usually not a problem for DFE applications. In general, applications are larger and the offset becomes negligible compared to the execution time of the compuation part. In the context of this work, this is not a problem either, because the future goal is to implement forward propagation in real-time simulation. This means that the input evaluation is done for each image individually and sequentially. The initialization offset will therefore only be present for the first image when the robot is launched.

Stream replication

All the previous implementations are parallelizing the operations in a single pipeline, taking advantage of logic availability to perform multiple simultaneous operations on a single input vector during one single clock cycle. This idea can be extended even further by replicating the pipeline multiple times to use more available logic on the engine. The complete pipeline can be replicated to perform all the previously described operations on other inputs at the same time.

Stream Replication
Schema of the input splitting and the parallel processing of each part of the stream using replication of the operation pipeline.

In this case, this solution is used to perform the classifaction on multiple input images at a time. In case of convolutional networks it could be used to compute multiple output channels simultaneously for example.

To implement stream replication, Java for-loops can be used in the kernels. The memories allocated for biases and weights don't have to be replicated, so they stay out of the loop. Input and output streams are named according to the looping index (see code below). Each iteration represents a replicated stream. The number of pipelines P is passed as argument to the kernel from the manager.

// Stream Replication loop
for(int p = 0; p< P; p++){
	    
  // Types differ in function of layer
  if(firstLayer == 1){
    input = io.input(IN_NAME+p, inVec, ~readingW & h.eq(0) & l.eq(0));
    [...] }
  else{
    input = io.input(IN_NAME+p, parallelVec, ~readingW & h.eq(0) & l.eq(0));
    [...] }
        
  [...]       
		
  // Compute oVecSize dot products at each clock cycle
  for (int r = 0; r < oVecSize; r++) {
    [...]   
  }

  // Only enable output when all operations have been done for 1 input vector
  DFEVar outEnable = w.eq(nbVecI - 1) & l.eq(loopLatencyVal - 1);	
		
  io.output(S_NAME + p, outVec, outEnable).connect(s);
  io.output(X_NAME + p, outVec, outEnable).connect(x);
}

A for-loop is also implemented in the manager to handle all inputs and outputs.

// Kernel definitions
for (int p = 0; p < P; p++) {
  kernelBlock1.getInput(FLinearLayerFixed8Kernel.IN_NAME + p) <== addStreamFromCPU("input" + p);
		
  addStreamToCPU("s1") <== kernelBlock1.getOutput(FLinearLayerFixed8Kernel.S_NAME + p);
  Fanout x1Fanout = fanout("x1Fanout"+p); //Redirect output to CPU and 2nd kernel with a fanout
  x1Fanout.getInput() <== kernelBlock1.getOutput(FLinearLayerFixed8Kernel.X_NAME + p);
			
  addStreamToCPU("x1") <== x1Fanout.addOutput("x11"+p);
			
  kernelBlock2.getInput(FLinearLayerFixed8Kernel.IN_NAME + p) <== x1Fanout.addOutput("x12"+p); 
			
  addStreamToCPU("s2") <== kernelBlock2.getOutput(FLinearLayerFixed8Kernel.S_NAME + p);
  addStreamToCPU("x2") <== kernelBlock2.getOutput(FLinearLayerFixed8Kernel.X_NAME + p);
}

// Engine interface definition
for(int p = 0; p < P; p++){
  // BFS is a parameter equal to the size of an input fpart
  ei.setStream("input" + p, CPUTypes.UINT8, CPUTypes.UINT8.sizeInBytes() * LS0 * BFS);
			 
  ei.setStream("s1"+p, cpuT, cpuT.sizeInBytes() * LS1 * BFS);
  ei.setStream("x1"+p, cpuT, cpuT.sizeInBytes() * LS1 * BFS);
  ei.setStream("s2"+p, cpuT, cpuT.sizeInBytes() * LS2 * BFS);
  ei.setStream("x2"+p, cpuT, cpuT.sizeInBytes() * LS2 * BFS);
}

Replicating the streams uses a lot of logic, as it physically reproduces the calculation pipeline in parallel on the engine. In fact, in this particular application, there is no advantage of using this solution over a larger input or output vector size. The overall resources usage is higher with stream replication than an equivalent optimization with higher output vector size (see Resource usage). This solution is therefore less efficient and complicates the code considerably, making the compilation time longer than for an equivalent optimization using larger vectors.

In addition, it adds a large number of input and output streams to handle from the CPU, which are limited to 16 anyway. This solution would be more suitable for applications like convolutions where the results of each pipeline are combined and directly reused on the DFE and not sent back to the CPU.

The following table compares both approaches: result ID #12 is using a higher output vector size and result ID #13 is using 2 replicated streams.

Result ID Nb streams Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images Theoretical time
11 1 28 32 8 1 0.014s 0.036s 0.061s 0.056s
12 1 28 32 16 1 0.013s 0.032s 0.052s 0.045s
13 2 28 32 8 1 0.015s 0.03s 0.053s 0.045s

The two implementations lead to the same execution time. The complexity of stream replication should therefore be avoided for this design, as a better and lighter solution exists. Logic usage is exposed in the Resource usage section.

Limitations and final optimization

A large number of solutions to optimize the DFE design have been described in the previous sections. Unfortunately these are limited by several factors and cannot simply be extended indefinitely, including the size of the input and output vectors.

The limiting factors discussed so far are the data transfer between the CPU and the FPGA chip, and the overhead involved in interfacing the two components. The number of available resources also limits the design optimization. This aspect is discussed in more detail in the Resources Usage section.

Another important factor to take into account is the timing. Indeed, in the Higher DFE clock frequency section, the frequency was increased to 300MHz. A high clock rate and a complex design can lead to a timing failure when performing operations in the pipeline. If two registers that store the values of two consecutive pipeline operations are too far apart, the time between two clock cycles may be too short (3.333 ns for 300MHz for example). Logical operations and interconnections induce delay. The frequency must be adapted accordingly to avoid timing failures.

Taking all these factors into account, the most optimized design found and compiled for this work is result ID #14 and shown in the following table. It is compared to the best CPU results obtained.

Result ID Clock Frequency Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images
14 180 49 32 16 1 0.011s 0.025s 0.041s
CPU - - - - - 0.043s 0.075s 0.110s
Gain - - - - - 3.9x 3x 2.68x

It involves respectively vectors of size 49 and 32 as input and 16 and 1 as output. The clock frequency is set to 180MHz. Gain factors are between 2,5 and 4. For a smaller number of images, the gain is a bit larger, probably because the times are so low that the threads of the CPU don't work at full efficiency for long enough. For a larger number of images, the threads run for a longer time, inducing a higher efficiency and therefore a better CPU performance.

Multiple DFEs

Eight DFEs are available on Jumax. Spreading the inputs over several DFEs still allows a fair comparison since both CPUs were used for the CPU results.

Using multiple engines decreases the execution time but not significantly either. For such a simple design, the overhead of data transmission and CPU interfacing is too high to make a difference. This is because the Infiniband transfer bus is shared between each DFE. Two DFEs are sufficient to obtain the optimal result.

Result ID Nb DFEs Clock Frequency Input Dim1 Input Dim2 Output Dim1 Output Dim2 10'000 images 30'000 images 60'000 images
14 1 180 49 32 16 1 0.011s 0.025s 0.041s
15 2-8 180 49 32 16 1 0.009s 0.017s 0.033s
CPU - - - - - - 0.043s 0.075s 0.110s
Gain - - - - - - 4.78x 4.41x 3.33x

With multiple engines, the gains are between 3.33 and 5, depending on the number of images for the same reason as explained in the previous section.

Max files can be loaded on multiple engines using a for-loop:

max_file_t *max_file;
max_engine_t *max_engines[numEngines];
max_file = FixedForwardProp_init();
for(int i = 0; i < numEngines; i++){
  max_engines[i] = max_load(max_file, "*");
}

A non-blocking run command exists to execute actions on the engine and waiting for the result. This way all engines are executed simultaneously.

max_run_t *runs[numEngines];

for(int i = 0; i < numEngines; i++){
  runs[i] = ForwardProp_run_nonblock(max_engines[i], &actions[i]);
}
for(int i = 0; i < numEngines; i++){
  max_wait(runs[i]);
}

Finally to unload all engines:

for(int i = 0; i < numEngines; i++){
  max_unload(max_engines[i]);
}

The kernel and manager codes implementing results #7 to #15 can be found here: FLinearLayerFixedKernel.maxj and ForwardPropFixedMAX5CManager.maxj. The associated CPU file is here: fixed-forward-test.cpp.

Resources usage

In the Basic design section, a first overview of the resources used is presented, as well as the role of each type on the FPGA. The following graph shows the resources usage for each previous presented optimization result. The corresponding optimization solution is labelled under the graph for clarity.

Resources usage
Graph of the resources usage per result in percentage of the available resources on a Jumax DFE.

The five first results show a linear scaling of logic blocks, DSPs and RAM when doubling input and output vectors. This result is very intuitive, as the compiler duplicates the logic usage to perform parallelized operations. DSPs are used in the dot product computation for multiplications so it grows linearly with the number of dot products done at each clock cycle. Memory is also growing because increasing the output vector implies adding multiple read ports to the memory and therefore the compiler duplicates it. Memory also includes FIFOs and storing the input vectors so it doesn't scale as fast as DSPs but still has an affine relation with design and computation size.

Loop tiling optimization gets rid of the latency but doesn't perform any simultaneous dot products. Therefore the number of DSPs drops for this solution.

Increasing the frequency doesn't impact the resources usage.

Result ID #11 corresponds to the implementation of the 8 bits-sized data for I/O transfer. DSPs and Memory are not impacted, but LookUp Tables and Flip-Flops are. This behavior comes from the normalization that is now done on the FPGA. Casting the values of the mean and standard deviation in fixed-point representation and performing a high number of divisions are two very expensive operations in logic.

Results ID #12 and #13 compare stream replication with larger vector size. As exposed in the Stream replication section, the resources are significantly higher for the same execution time. Therefore stream replication is not an interesting solution for this application.

The final resources usage is around 20% of logic blocks, 50% of DSPs and 33% of RAM.

Conclusion

The following graph summarizes the results obtained through the complete optimization process. It shows the execution time for 60'000 images on the DFEs. The orange dashed line corresponds to the best CPU results (see CPU results for more details). Vector sizes and clock frequency are also displayed.

Execution times
Execution times for each presented result on 60'000 MNIST images along with corresponding clock frequency and vector sizes. The corresponding CPU best result is shown with the orange dashed line.

The following graph shows the same curve but for the three different quantities of images (10'000, 30'000 and 60'000). The dashed lines show the corresponding best CPU results.

Execution times
Execution times for each presented result on different MNIST images numbers with the corresponding CPU best result (dashed lines).

As explained, CPU multithreading would be more efficient with a larger problem size. This explains why 60'000 images give a smaller gain than 10'000 inputs. But, at a very high number of inputs, this would also be compensated by memory growth and slower access to it. The overhead of using multiple threads would increase as well as the CPU general efficiency. In general, CPU doesn't scale linearly with problem size.

As mentioned in a previous section, execution time is not only limited by the computation part but also by the data transfer and by some interfacing between the CPU and the FPGA. The toy example presented in this chapter is a rather simple design. Larger networks would lead to a larger gap with the CPU, because the calculation part would be way more important than the data transfer.

The final design his only using 20% of the logic blocks and 50% of the total available DSPs. But increasing the design can only be done in two ways: increasing the input and the output vectors sizes. However, these increases are problematic. The majority of the operations are done at the dot product level of the first layer. Doubling the size of one of the vectors will double the number of DSPs and it will approach 100%. Using a resource completely is not a good idea because it is very difficult to avoid timing failures, since there is no margin on the placement and routing of components. The second problem that increasing the size of the output vectors will cause is a high fanout in the dot product of the first layer. Too much fanout also leads to timing problems. There are solutions to these problems, but they are particularly specific to this toy example and the impact of a higher optimization will be quite small. Therefore, the current design is probably close to the best performance.

Clone this wiki locally