Towards democratized IC design and custom computers

Integrated circuit (IC) design is often considered a “black art”, restricted to only those with higher education or years of training in electrical engineering. As the semiconductor industry struggles to expand its workforce, IC design needs to be made more accessible.

The advantage of custom computing
General-purpose computers are widely used, but their performance improvement has slowed significantly — from more than 50% per year in the 1990s to just a few percent in recent years — due to challenges in power, heat dissipation, space and space. cost.

Instead, the research community and industry have moved towards custom computing for better performance by tailoring custom architectures to the workload in particular application domains. A good example is the Tensor Processing Unit (TPU) announced by Google in 2017 for accelerating deep learning workloads. Designed in 28nm CMOS technology as an application-specific integrated circuit (ASIC), TPU demonstrated nearly a 200x performance/watt power efficiency advantage over the general-purpose Haswell Central Processing Unit (CPU), a leading server-class CPU at the time. of publication. Such custom accelerators, or domain-specific accelerators (DSA), achieve efficiency through custom data types and operations, custom memory accesses, massive parallelism, and greatly reduced instruction and control overhead.

However, this adjustment entails high costs (approaching $300M at 7nm, according to McKinsey) that the masses cannot afford. Field-programmable gate arrays (FPGAs) offer an attractive, cost-effective alternative to DSA implementation. Given its programmable logic, programmable interconnections, and customizable building blocks — block random access memory (BRAM) and digital signal processing (DSP) — an FPGA can be customized to implement a DSA without going through a lengthy manufacturing process and can be reconfigured for a new DSA in seconds. In addition, FPGAs have become available in the public clouds, such as Amazon AWS F1 and Nimbix. One can create a DSA on the FPGA in these clouds and use it at a rate of $1-$2/hour to accelerate desired applications even if FPGAs are not available in the local computing facility. My research lab developed efficient FPGA-based accelerators for multiple applications, such as data compression, collation and genomic sequencing, with 10x-100x performance gain/energy efficiency over state-of-the-art CPUs.

The barrier to custom computing
However, creating DSAs in ASICs or FPGAs is considered hardware design, typically using register transfer level (RTL) hardware description languages ​​such as Verilog or VHDL, which most software programmers are unfamiliar with. According to the US Bureau of Labor Statistics 2020 datathere were more than 1.8 million software developers in the United States, but less than 70,000 hardware engineers

Recent advances in high-level synthesis (HLS) show promise in making circuit design more accessible, as it can automatically compile computational kernels in C, C++, or OpenCL into an RTL description to run ASIC or FPGA designs. The quality of the circuits generated by existing HLS tools depends very much on the structure of the C/C++ input code and the hardware implementation tips (called “pragmas”) provided by designers. For example, for the simple 7-line code of the single-layer convolutional neural network (CNN) widely used in deep learning, shown in Figure 1, the existing commercial HLS tool generates an FPGA-based accelerator that is 108x slower than a single- core PROCESSOR. However, after properly restructuring the input C code (to tile the computation, for example) and inserting 28 pragmas, the final FPGA accelerator is 89x faster than a single-core CPU (more than 9,000x faster than the initial non-optimized HLS solution).

fig. 1: Simple 7-line code of single-layer convolutive neural network (CNN) widely used in deep learning

These pragmas (hardware design hints) inform the HLS tool where to parallel and pipeline the computation, how to partition the data arrays to map to on-chip memory blocks, etc. However, most software programmers don’t know how to implement these hardware-specific optimizations. .

Our solutions
To enable more people to design DSAs based on software-programming-friendly code, we take a three-pronged approach:

• Architecture-driven optimization
• Automated code transformation and pragma insertion
• High-level domain-specific language support (DSLs)

A good example of architecture-driven optimization is the automatic generation of systolic arrays (SA), an efficient architecture that uses only local communication between adjacent processing elements. It is used by TPU and many other accelerators for deep learning, but it is not easy to design. A 2017 Intel study showed that 4-18 months are needed to design high quality SA even with HLS tools. Our recent work, AutoSA, offers a fully automated solution. Once a programmer marks a portion of C or C++ code for implementation in the SA architecture, AutoSA can generate a set of processing elements and associated data communication network, optimizing computational throughput. For the CNN example, AutoSA generates an optimized SA with over 9,600 lines of C code, including pragmas, achieving over 200x speed over a single-core CPU.

For programs that don’t easily fit into common computational patterns (such as SA or stencil computation, for which we have good solutions using architecture-driven optimization), our second approach is to perform automated code transformation and pragma insertion to repeatedly parallelize or pipeline the computation. based on bottleneck analysis or guided by graph-based deep learning. Building on AMD/Xilinx’s open-source Merlin Compiler (originally developed by Falcon Computing Solutions), our tool — called AutoDSE — can eliminate most, if not all, pragmas brought in by expert hardware designers and achieve comparable or even better performance (as demonstrated in Xilinx’s Vitis HLS vision acceleration library).

The third attempt is to further increase the level of design abstraction to support DSLs so that software developers in certain application domains can easily create DSAs. For example, based on the open source HeteroCL intermediate representation, we can support Halide, a widely used image processing DSL with the advantageous property of decoupling algorithm specification from performance optimization. For the blur filter example written in 8 lines of Halide code, our tool can generate 1,455 lines of optimized HLS C code with 439 lines of pragmas, achieving 3.9x faster than a 28-core CPU.

These efforts combine to create a programming environment and compilation flow that is friendly to software programmers, allowing them to make DSAs efficient and affordable (especially on FPGAs). This is critical to democratizing custom computing.

Broaden participation
In their 2018 Turing Award lecture, “A new golden age for computer architectureJohn L. Hennessy and David A. Patterson concluded: “The next decade will see a Cambrian explosion of new computer architectures, which means exciting times for computer architects in academia and industry. We want to participate in this exciting journey.” extending to performance-oriented software programmers capable of creating their own custom architectures and accelerators on FPGAs or even ASICs to achieve significant performance and energy efficiency improvements.

This article is based on Jason Cong’s recent Vision Address on the 35th International Conference on VLSI Designs† The whole talk can be found here

Jason Congo

(all messages)

Jingsheng Jason Cong is an IEEE Fellow and a member of the US National Academy of Engineering and the Valene Chair for Engineering Excellence at the Computer Science Department at the University of California, Los Angeles. He is the recipient of the 2022 IEEE Robert N. Noyce Medal “for fundamental contributions to electronic design automation and FPGA design methods.” Cong contributions cover three major areas of EDA tool development: logic synthesis algorithms for FPGAs, connection optimization algorithms for physical design, and high-level synthesis of programming languages ​​that are friendly to software programmers.

Leave a Comment

Your email address will not be published. Required fields are marked *