High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

Why do we need high-level synthesis

High-level Synthesis (HLS) for short, refers to the process of automatically converting the logical structure described in a high-level language into a circuit model described in a low-level language. The so-called high-level languages, including C, C++, SystemC, etc., usually have a high degree of abstraction, and often do not have the concept of clock or timing. In contrast, low-level languages ​​such as Verilog, VHDL, and SystemVerilog are usually used to describe cycle-accurate register transfer-level circuit models, which are the most commonly used circuit modeling and FPGA designs today. Describe the method.

However, HLS technology has gained a lot of attention and rapid development in the past decade, especially in the field of FPGA. Throughout the major FPGA academic conferences in recent years, HLS has always been one of the most concentrated areas of research in academia and industry. The main reasons are as follows.

First, modeling circuits using a higher level of abstraction is an inevitable choice for the development of integrated circuit design. With the development of Moore’s Law, the complexity of integrated circuits has gradually exceeded the scope that humans can manage manually. For example, the A13 chip built into the Apple iPhone 11 has about 8.5 billion transistors.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

(Image via Stephen Shankland/CNET)

However, according to research published by NEC in 2004, a chip design with 1 million logic gates typically requires 300,000 lines of RTL code to be written. Therefore, it is unrealistic to design contemporary chips completely using RTL-level logic abstraction, and it will cause huge pressure on all links such as design, verification, and integration.

In contrast, modeling the system using high-level languages ​​such as C and C++ can compress the code density by 7 to 10 times, which greatly eases the design complexity.

Second, high-level languages ​​can promote the efficiency of IP reuse. Traditional RTL-based IP often needs to define fixed architecture and interface standards, and it takes a lot of time for system interconnection and interface verification when IP is reused. In contrast, high-level languages ​​hide these requirements, leaving the HLS tools to implement them.

For FPGAs, there are a large number of mature IP units in modern FPGAs, such as embedded memories, arithmetic operation units, embedded processors, and recently emerging AI accelerators, network-on-chip systems, and so on. These FPGA IPs have fixed functions and locations, so they can be fully utilized by HLS tools, simplifying synthesis algorithms and improving the performance of synthesized circuits while improving IP reuse efficiency.

Third, HLS can help software and algorithm engineers participate in, or even lead, chip or FPGA design. This is because HLS tools can encapsulate and hide the implementation details of the hardware, allowing software and engineers to focus on the implementation of the upper-level algorithms. For hardware engineers, HLS also helps them perform rapid design iterations and focus on optimized designs for performance-, area- or power-sensitive modules and subsystems.

The past and present of FPGA high-level synthesis

With the rapid increase in the complexity of integrated circuits, chip design methodologies are constantly evolving. Long before the advent of FPGA, people have begun to try to get rid of the design method that relies on manual inspection of the chip layout, and instead explore the use of high-level language to describe the behavior of circuit logic, and use automated tools to convert circuit models into actual circuit designs.

In the 1980s and 1990s, the HLS tool for integrated circuit design has been a research hotspot in academia. The more representative works include the CMU-DA (design automation) tool of Carnegie Mellon University, and the force-directed scheduling algorithm proposed by Carleton University in Canada.

From now on, these works have laid the foundation for the current circuit synthesis algorithm, and provided a lot of valuable experience and reference for the later HLS research. However, the HLS work at this stage has failed in the transformation of results and has not been effectively transformed into industrial practice. One of the main reasons is “meeting the right person at the wrong time”.

At the time of the boom in Moore’s Law, integrated circuit design was going through the biggest change in history. In the back end, automatic placement and routing has gradually become the mainstream; in the front end, RTL synthesis is also gradually emerging. Traditional circuit design engineers have begun to adopt RTL-based circuit modeling methods to replace the traditional schematic and layout-based design, which has brought about the rapid development of RTL synthesis tools. In contrast, HLS research at this stage often uses a special programming language, such as the language called “ISPS” adopted by CMU-DA, so it is difficult to gain the favor of engineers who are in the “honeymoon period” with RTL .

After a period of silence, HLS began to gain attention from academia and industry again after 2000. The more famous tools include Bluespec and AutoPilot. The main reason for this change is that HLS tools began to use C/C++ as the main target language, which was gradually accepted by many system and algorithm engineers who did not understand RTL. At the same time, the results generated by HLS tools have also made great progress, and in some application fields, they can even have a performance level similar to that of handwritten RTL.

In addition, the gradual rise of FPGA has also played an important role in boosting the development of HLS. Unlike ASIC designs, FPGAs have a fixed amount of on-chip logic resources. Therefore, the HLS tool does not need to be overly entangled in the absolute optimization of area, performance and power consumption in ASIC design, but only needs to map the design reasonably to the fixed architecture of the FPGA. In this way, HLS becomes an excellent way to quickly implement target algorithms on FPGAs.

Today, high-level synthesis technology has achieved further development. Large FPGA companies have launched their own HLS tools, such as Xilinx’s Vivado HLS and Intel’s HLS compiler, OpenCL SDK, etc. There are also many achievements in academia, such as LegUp of the University of Toronto and so on.

Next, Lao Shi will take the HLS tool AutoPilot as an example to briefly introduce the main working principles of high-level synthesis.

The main working principle of high-level synthesis

The AutoPilot tool of AutoESL can be said to be the most successful case of academic achievement transformation in the field of HLS. AutoPilot originated from the xPilot project led by Professor Cong Jingsheng of UCLA. Later, he founded AutoESL with Zhang Zhiru (currently an associate professor at Cornell University), a doctoral student who was in charge of the project at the time, and was acquired by Xilinx in 2011. Vivado HLS.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

The workflow diagram of AutoPilot is shown in the following figure. On the front end, it uses an LLVM-based compiler architecture capable of handling models written in synthesizable ANSI C, C++, and OSCI SystemC. The front-end compiler, named llvm-gcc, converts high-level language models into intermediate expressions (IRs) and performs a series of code optimizations for code complexity, redundancy, parallelism, and more. Then according to the specific hardware platform, comprehensively generate RTL code, verification and simulation environment, and necessary timing and layout constraints.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

The success of AutoPilot is that its HLS results outperform those achieved by manually optimized RTL in some application areas. For example, in a Sphere Decoder IP used in a wireless MIMO system, AutoPilot successfully synthesized a 4000-line C-code algorithm onto a Virtex5 FPGA running at 225MHz and achieved less logic resource usage from the Xilinx Sphere Decoder IP quantity, see the figure below. This result is very shocking now, and it is a good proof that HLS has the potential to achieve better performance than RTL IP.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

For more technical details and paper content about AutoPilot, please reply to “HLS” or “High-level Synthesis” in the background of the Laoshitanxin official account.

Commonly used optimization methods for high-level synthesis tools

Traditional processor compiler designs often have only one main goal, and that is to maximize performance. In contrast, high-level synthesis tools need to consider the main indicators of various circuit designs, such as performance, power consumption, area, etc., and also take into account the performance of the tool itself, such as occupied resources and running time. Therefore, when developing HLS tools, more optimization methods should be considered and adopted, and these optimization methods are also the focus of current research in the field of HLS in academia and industry. In general, the mainstream optimization methods of HLS tools are as follows.

01

Word length analysis and optimization

One of the most important features of FPGAs is that data paths and operations of arbitrary word length can be used. Therefore, the FPGA HLS tool does not need to be bound by a fixed-length (such as common 32-bit or 64-bit) expression, but can perform global or local word-length optimization on the design, thus achieving the dual performance improvement and area reduction. Effect.

However, word length analysis and optimization requires users of HLS to have a deep understanding of the algorithms and datasets to be synthesized, which is one of the main factors limiting the widespread use of this optimization method.

02

Loop optimization

Loop optimization has always been the research focus and hotspot of HLS optimization methods, because it is the key link to effectively map the original sequential execution of high-level software loops to the parallel execution of hardware architecture.

The ultimate goal of loop optimization is to try to implement two adjacent operations in the loop with the smallest delay. Ideally, adjacent loop operations can be executed completely in parallel. However, it is difficult to fully unroll the loops due to hardware resource constraints, and more so because of nesting and dependencies between loops. How to optimize various loops to achieve the optimal hardware structure has become the most concerned point in academia and industry.

A popular loop optimization method is the so-called polyhedral model, or Polyhedral Model. The polyhedron model is widely used. In HLS, it is mainly used to represent the loop statement as a spatial polyhedron (see the figure below), and then according to the boundary constraints and dependencies, the statement is scheduled through geometric operations, so as to realize the transformation of the loop.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

Regarding the details of the polyhedron model, this article will not be expanded. Interested readers can reply to “HLS” or “High-level Synthesis” in the background of the official account to obtain more relevant information. It should be pointed out that the polyhedron model has achieved considerable success in FPGA HLS. Many studies have proved that the polyhedron model can help optimize performance and area, and can also help improve the efficiency of FPGA on-chip memory usage.

03

Support for software parallelism

A major difference between C/C++ and RTL is that the programs written in the former are designed to be executed sequentially on the processor, while the latter can achieve parallel processing of tasks by directly instantiating multiple arithmetic units. With the gradual support of parallelism by processors and the rise of non-processor chips such as GPUs, C/C++ began to gradually introduce support for parallelism. For example, multi-threaded parallel programming methods such as pthreads and OpenMP have emerged, as well as C language extensions such as OpenCL for parallel programming of heterogeneous systems such as GPUs.

Therefore, as an HLS tool, it is bound to increase the support for the parallelism of these software. LegUp, for example, integrates support for pthreads and OpenMP, enabling task and data-level parallelism.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

For more technical details and paper content about LegUp, please reply to “HLS” or “High-level Synthesis” in the background of the official account of Laoshitanxin.

High-Level Synthesis: The Last Piece of the Puzzle to Unlock FPGAs’ Broad Applications

In addition, Altera has launched the OpenCL SDK before being acquired by Intel, which can perform high-level synthesis of OpenCL, and generate two parts of FPGA circuit logic and CPU code, so as to realize the rapid development of FPGA as a hardware acceleration module.

High-level integrated development prospects

After more than ten years of development of HLS, although there are successful cases of FPGA HLS commercialization such as AutoPilot and OpenCL SDK, there is still a long way to go before it can completely replace manual RTL modeling.

For example, for FPGA, the memory bottleneck has always been an important factor restricting system performance. In addition to various on-chip BRAMs, there are also various off-chip memory units, such as DDR, QDR, and HBM, which has emerged in recent years. Therefore, the effective use of on-chip and off-chip memory cells has always been a research hotspot in HLS.

In addition, the simulation and debugging of HLS also needs to be further explored. On the one hand, a formal method is required to prove that the RTL code generated by HLS is equivalent to the high-level code; debugging methods, etc. When there is a problem with the hardware, how to debug the loopholes in the software also needs the support of methodology.

In recent years, more and more researches have begun to focus on domain-specific programming languages ​​and corresponding HLS. For example, P4, which was introduced in the previous article, is a high-level programming language for network packets. With the development of artificial intelligence, Python HLS for AI applications has also appeared. By using domain-specific HLS, tools can be further optimized for specific domains, which can greatly improve system performance, reduce area and power consumption.

Concluding remarks

The industry generally believes that the reason why GPU has achieved extraordinary success in the era of artificial intelligence is largely due to the friendly programming language and environment for software and algorithm engineers. In contrast, although FPGA is constantly expanding its application scope and has obvious advantages over GPU in performance and power consumption, its programming model is still mainly based on RTL development by hardware engineers.

Lao Shi believes that high-level synthesis of FPGAs is an inevitable trend in the development of the industry. It is believed that as the problems in the HLS field continue to be broken, efficient programming of FPGAs using high-level languages ​​will inevitably be realized, and this will eventually become the last piece of the puzzle for wider application of FPGAs.

The Links:   7MBP75RA060 6MBI100S-120-02

Author: Yoyokuo