Deep Learning on FPGA: From Model Training to Inference

The integration of Deep Learning models with FPGA (Field-Programmable Gate Array) platforms, like PYNQ Z2, offers advantages such as adaptability and performance. This article explores the technical steps of this integration, from the initial model training phase to achieving high inference speeds on the FPGA. This is a representation of a full workflow in FPGA terms.

Model Training with Brevitas


Brevitas, a PyTorch extension, utilizes the quantization of neural networks, a step necessary for FPGA deployment. Quantization, the process of reducing the neural network parameters into lower precision, optimizes models for the limited computational resources of FPGAs. Models like LeNet and ResNet are trained and quantized while ensuring FPGA compatibility by using datasets such as CIFAR10 or SVNH. Models are chosen based on their architecture complexity for classification tasks. Also, they implement convolutional layers which are specific for FPGA workflow. Quantization process for a precision example (32 bits to 8 bits) is shown in Figure 1.

Figure 1. Quantization process in Brevitas


ONNX: The connection to FPGA Compatibility

After the training and quantization process, models are converted to the ONNX (Open Neural Network Exchange) format. We used a FINN version of ONNX, specially designed for the FINN environment. This is executed directly by Brevitas. Conversion is required to be done within FINN, where it creates the groundwork for subsequent FPGA optimization.

Figure 2. ONNX conversion process in Brevitas-FINN


FPGA Optimization via FINN

To embed deep learning models for FPGA deployment, the FINN framework is utilized. In general, FINN refines ONNX models, ensuring that they align with the specific architecture, in this case PYNQ Z2 (ZYNQ 7020). This optimization is done with memory partitioning and data movement within the FPGA using the framework. It is ensured that data is correctly placed within computational units, where latency is reduced and speed increased. Configurable Logic Blocks (CLBs) and I/O Blocks are integral for both logic operation and external signal interfacing.

Interconnects manage continuous dataflow while Clock Management synchronizes operations. Memory elements offer efficient on-chip storage, and DSP slices accelerate arithmetic operations. Custom JSON files in FINN enable precise definitions of components like Processing Elements (PE), Multiple Data (SIMD), number of DSP slices and custom set of instructions which optimizes each layer and operation for efficient model-FPGA interaction. Depending on the elements we set for the FPGA and Deep Learning model, we are doing steps which include streamlining, conversion to HLS code and parallelization. As is shown in the Figure 3. below, those processing steps are required to generate an IP core which will be doing computational tasks for image processing through neural networks. The end result is the deployment package which is required to be explicitly run on an FPGA.

Figure 3. Workflow of model compilation in FINN


Inference Performance on PYNQ Z2 FPGA


Inference speeds achieving more than 1000 FPS (frames per second) highlight the efficiency gains achieved when deep learning models are optimized for FPGA platforms. This shows that inference time is significantly reduced when hardware is allowed to do math tasks.

The graph in Figure 4. illustrates the difference in inference times across various platforms. It’s evident that the FPGA (PYNQ Z2) offers a significant advantage in terms of speed.

However, speed isn’t the only metric where FPGA excels. In terms of power efficiency, the PYNQ Z2 demonstrates a clear superiority over traditional GPU and CPU setups. As depicted in Table 1. below, the FPGA’s power consumption is considerably lower compared to other hardware platforms, highlighting its efficient performance.

Table 1. Performance metrics


ModelPlatformInference Time (ms)Power Consumption (W)
LeNet5FPGA PYNQ Z20.82.2
Jetson Nano33.434.2
CPU (ARM)3214
ResNet18FPGA PYNQ Z20.9522.5
Jetson Nano38.234.5
CPU (ARM)3324


Comparative Analysis with CPU and Jetson

When compared to the Jetson Nano and CPU (ARM), the FPGA’s performance becomes even more impressive.The Jetson Nano has an inference time of 33.4 ms and 38.2 ms and power consumption of 34.2 W and 34.5 for LeNet5 and ResNet18, respectively.  Similarly, the CPU (ARM), with an inference time of 321 ms and 332 ms and a consistent power consumption of around 4 W for both models, lags behind the FPGA in both speed and efficiency.

The comparative analysis shows the FPGA’s balanced performance, offering both speed and power efficiency, making it a preferred choice for implementing deep learning models for tasks like image classification. The graphs below provide a visual representation, clearly delineating the performance and efficiency metrics across the FPGA, Jetson, and CPU platforms. All of these metrics are shown in Figure 4. for models such as ResNet18 and LeNet5.

Figure 4.  Comparison of Inference time and Power Consumption over different hardware


Use case examples

The obvious benefits of using FPGAs for model inference are highlighted in real-time applications where time-sensitive operations must be carried out, and power-constrained applications where power consumption should be minimized.

An example of a time-sensitive use case could be the manufacturing industry where quality control needs to be done directly on the production line. Defective products should be ejected from the production by the quality control system. Such a system should be able to handle fast-moving conveyors so that it can analyze the products and react accordingly, all in a timely fashion. Given the inference performance results, such a use case would benefit from providing analysis computed by FPGAs.

Figure 5. Example of a quality control system on production line

An example of a power-constrained application would be on-board processing on a satellite platform. Such applications usually have limited power budgets and require precise power consumption allocations to be able to function properly for the rest of the system’s predicted lifetime. Another side-benefit of utilizing FPGAs for such systems is that the processed data usually requires less storage space than raw unprocessed data. Given the strict restrictions for satellite platforms, having more storage space enables flexibility for data gathering and down-link operations.

Figure 6. Example of on-board satellite data processing (car detection)


Conclusion: Potential of FPGA in Deep Learning

The exploration from model training to real-time inference describes the FPGA’s main role in deep learning. With tools like Brevitas and FINN, the PYNQ Z2 advertises a sum of speed and efficiency, showcasing reduced inference times and minimal power consumption.

Our comparative analysis with the Jetson Nano and CPU (ARM) highlights the FPGA’s performance. It’s not just about faster computations, but also about achieving these speeds with more power efficiency, a critical factor in real-time applications.

As we advance, the fusion of deep learning and FPGA technology promises to unlock unprecedented levels of performance and efficiency.

To explore, collaborate, or join this unfolding journey, connect with us at!