Convolutional Neural Network Inference On CPU

Goal

In Autonomous Car Simulation, a car was trained to follow a track using the PyTorch framework. The project is not focused on the accuracy of the network but on its execution time and more particularly its impact on the simulation speed. The goal is to show that the speed of the simulation can be improved using FPGAs instead of optimized CPU code. This page explains the key points of the adaptation of the inference into CPU C++ code. The optimized car controller code is located here.

Structure

The basic structure is taken from the MLP framework presented in the first part of the project. In addition to the Network, LinearLayer, and Neuron classes, three new ones are defined, namely ConvLayer, Kernel, and MaxPoolLayer:

ConvLayer: A ConvLayer object represents a convolutional layer of the network. It is made of a nb_kernels member, which represents the number of kernels (output channels) in the layer, of a kernel_size member, which contains the width/height of the kernels, and of a vector of Kernel objects.
Kernel: A Kernel object represents a kernel of the convolutional layer it belongs to. It stores the weights, the biases, and the number of channels of the kernel, corresponding to the input channels of the layer.
MaxPoolLayer: A MaxPoolLayer object represents a max-pooling layer of the network. It stores information about the size of the input, as well as the filter size.

The Network object is initialized with the architecture provided in the autonomous car paper. The parameters (weights and bias) are loaded from the respective text files into the Neuron and Kernel objects.

At each time step of the simulation, the front camera image is read, normalized using the mean and standard deviation values computed by the training script, and passed through the forward propagation to compute the velocity and direction. These output values are applied to the car for driving.

The floating point controller code containing is located here.

Optimization: fixed-point representation

As explained in the MLP optimization section, code execution can be optimized using fixed-point number representation. With 16 fractional bits, the precision is high enough to still drive the car accurately on the track.

Optimization: OpenMP multithreading

Multithreading is a bit trickier than for the MLP example, as there is only a single image going through the forward propagation during a step of the simulation. Some parts of the network can still be parallelized.

For convolution layers, the operations on different output channels are independent and can be done in parallel. For linear layers, the operations can be parallelized on output neurons, but the process is very limited as the threads are trying to access the same resources at the same time. Even on Jumax, using more than 6 threads becomes sub-optimal because of the memory access overhead.

It is also important to be careful about the order of the outputs. OpenMP allows to recover the correct order of a loop output using threads index values.

The optimized controller code containing fixed-point representation and multithreading is located here.

Results

The following table shows the results obtained by running the autonomous car during 60 seconds of simulation on multiple optimization levels. Float CNN refers to the neural network with no optimization and floating-point representation. Fixed CNN refers to the fixed point representation implementation. Multithreading CNN is for the adaptation to multiple parallel threads. Finally, No CNN corresponds to driving the car in an open-loop to remove the impact of the CNN computation and identify other limiting factors. Experiences are conducted on Jumax and on a local machine. The double AMD CPU is located on Jumax, while the Intel® one is from the local machine. Webots is running headless on Jumax.

	2 × AMD EPYC 7601 @ 2.7 GHz × 64T				Intel® Core™ i7-6700 CPU @ 3.40GHz × 8T
	real time [s]	ratio	avg/ts [ms]	Limitation	real time [s]	ratio	avg/ts [ms]	Limitation
Float CNN	202.41	0.296	100.9	CNN	211.83	0.286	105.8	CNN
Fixed CNN	191.75	0.313	95.4	CNN	157.78	0.381	78.5	CNN
Multithreading CNN	58.75	1.021	29.3	CNN	54.81	1.095	27.4	CNN
No CNN	45.7	1.31	22.8	Camera	2.82	21.29	1.4	Controller

Jumax doesn't have any GPU installed which makes the simulation process slower. The No CNN shows that the main bottleneck (apart from the CNN) is the camera rendering, taking more than 20[ms] per timestep. Its impact can be reduced by parallelizing Webots operations (including camera rendering) and the CNN computation in the car controller. This is achieved by splitting the Webots step function in two parts: sending data and receiving data. All operations made in between are parallel to the Webots step processing. In the most optimized version of the code, ~22[ms] is needed for camera rendering and ~29[ms] for the forward propagation computation. In the best case, with a high reduction of the CNN execution time, the simulation speed would be limited to a ratio of ~1.3.

On the local machine with GPU, this is not the case. The camera processing takes about 1 [ms], which, even without forward propagation, is not the bottleneck. Indeed, other controller functions (such as image normalization) are even slower (1.4[ms] according to the above table).

For the comparison with the FPGA results, head to Performance Comparison CPU | FPGA.

JUMAX Access and Webots Installation

MaxIDE and MaxCompiler

Accelerated Multilayers Perceptrons

Implementation of multilayers perceptrons on CPU :
- Creation of a MLP Deep Learning Framework in C++
- MLP Forward Propagation on CPU: Tests and Results
Adapting the multilayers perceptron (MLP) inference for FPGA:
- MLP Forward Propagation on DFE: Structure
- MLP Forward Propagation on DFE: Results and Optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly