Optimizing AI algorithms without sacrificing accuracy

The ultimate measure of success for AI is how much it increases productivity in our daily lives. However, the industry faces enormous challenges in evaluating progress. The sheer number of AI applications is constantly evolving: finding the right algorithm, optimizing the algorithm, and finding the right tools. In addition, complex hardware engineering is updated quickly with many different system architectures.

Recent History of the AI ​​Hardware Conundrum

A 2019 Stanford report states that AI is accelerating faster than hardware development. “Before 2012, AI results closely followed Moore’s Law, with computing power doubling every two years. […] After 2012, compute will double every 3.4 months.”

Since 2015, when an AI algorithm overcame human error in object identification, major investments in AI hardware have powered semiconductor IP to accelerate next-generation processing, higher-bandwidth memories and interfaces to keep pace. Figure 1 shows how an AI competition progressed rapidly when back propagation and modern neural networks were introduced in 2012 and combined with Nvidia heavy computer GPU engines.

fig. 1: After the introduction of modern neural networks in 2012, classification errors quickly decreased and the results of human error quickly exceeded.

AI algorithms

AI algorithms are too large and demanding to run on SoCs designed for consumer products that require low power consumption, small footprint, and low cost. Therefore, the AI ​​algorithms are compressed using techniques such as pruning and quantization. These techniques mean that the system requires less memory and less computing power, but will affect the accuracy. The technical challenge is to implement compression techniques without affecting accuracy beyond what is necessary for the application.

In addition to the increase in complexity of AI algorithms, the amount of data required for inference has also grown dramatically due to the increase in input data. Figure 2 shows the memory and computing power required for an optimized vision algorithm designed for a relatively small footprint of 6 MB of memory (memory requirement for SSD-MobileNet-V1). As you can see, the bigger challenge in this particular example is not the size of the AI ​​algorithm, but rather the size of the data input. As pixels increase due to increasing pixel size and color depth, the memory requirement has grown from 5MB to over 400MB in the latest image captures. Today, the latest CMOS image sensor cameras from Samsung support mobile phones up to 108MP. These cameras could theoretically require 40 tera operations per second (TOPS) at 30 fps and more than 1.3 GB of memory. Techniques in the ISPs and special areas of interest in AI algorithms have limited the requirements to these extremes. 40 TOPS achievements are not yet available in mobile phones. But this example highlights the complexity and challenges in edge devices and also drives sensor interface IP. MIPI CSI-2 specifically targets features to address this with region-of-interest capabilities, and MIPI C/D-PHYs continue to increase the bandwidth to handle the latest CMOS image sensors data sizes drifting to hundreds of megapixels.

fig. 2: Requirements for SSD-MobileNet-V1 designed for 6MB of memory, based on pixel size benchmark results.

Solutions today compress the AI ​​algorithms, compress the images and target areas of interest. This makes optimizations in hardware extremely complex, especially with SoCs with limited memory, limited processing and small power budgets.

Many customers benchmark their AI solutions. Existing SoCs are benchmarked using various methods. Tera operations per second is a leading performance indicator. Additional performance and force measurements provide a clearer picture of the chip’s capabilities, such as the types and qualities of operations a chip can handle. Inferences per second is also a leading indicator, but needs context regarding frequency and other parameters. Therefore, additional benchmarks have been developed for evaluating AI hardware.

There are standardized benchmarks such as those from MLPerf/ML Commons and ai.benchmark.com. ML Commons provides metrics regarding accuracy, speed, and efficiency, which is very important to understand how well hardware can handle different AI algorithms. As mentioned, compression techniques can be used to fit AI into very small footprints without understanding the accuracy goals, but there is a tradeoff between accuracy and compression methods. ML Commons also provides common data sets and best practices.

The Computer Vision Lab in Zurich, Switzerland also provides benchmarks for mobile processors and publishes their results and hardware requirements along with other information that enables reuse. This includes 78 tests and over 180 performance aspects.

An interesting benchmark from Stanford called DAWNBench has since supported ML Commons’ efforts, but the tests themselves covered not only an AI performance score, but also a total time for processors to perform both training and inference from AI algorithms. to feed. This addresses one of the most important aspects of hardware design engineering goals, which is to reduce the total cost of ownership or total cost of ownership. The time to process AI determines whether cloud-based AI rental or edge computing-based hardware ownership is more feasible for organizations in terms of their overall AI hardware strategies.

Another popular benchmarking method is the use of common open-source charts and models such as ResNET-50. There are three problems with some of these models. Unfortunately, the dataset for ResNET-50 is 256×256, which is not necessarily the resolution that can be used in the end application. Second, the model is older and has fewer layers than many of the newer models. Third, the model may be manually optimized by the processor’s IP vendor and not represent how the system will perform with other models. But there are a large number of open source models available that are used outside of ResNET-50 and are likely to be more representative of the latest advances in the field and provide good performance indicators.

Finally, custom graphs and models for specific applications are becoming more common. Ideally, this is the best-case scenario for benchmarking AI hardware and ensuring that optimizations can be made effectively to reduce power and improve performance.

SoC developers all have very different goals, as some SoCs seem to provide a platform for high-performance AI, others for lower performance, some for a wide range of functions and others for very specific applications. For SoCs not sure which AI model to optimize for, a healthy mix of both custom and freely available models provides a good indication of performance and power. This mix is ​​most commonly used in the current market. However, the advent of newer benchmarking standards as described above seems to gain some relevance in comparisons after SoCs are introduced to the market.

Pre-Silicon Evaluations

Due to the complexity of edge optimizations, AI solutions today must design the software and hardware together. To do this, they need to use proper benchmarking techniques, as described earlier. They also need to have tools that allow designers to accurately explore various optimizations of the system, SoC, or semiconductor IP, examining the process node, memories, processors, interfaces, and more.

Synopsys provides effective tools to simulate, prototype and benchmark the IP, the SoC and the wider system in certain cases.

The Synopsys HAPS prototyping solution is often used to demonstrate the capabilities of different processor configurations and the tradeoffs. In particular, Synopsys has shown where bandwidths of the wider AI system, beyond the processor, start to become a bottleneck and when more bandwidth from the sensor input (via MIPI) or memory access (via LPDDR) may not be optimal for a processing task.

For power simulations, vendor estimates can vary widely, and emulation has proven to be superior to simulation and/or static analysis of AI workloads. This is where the Synopsys ZeBu emulation system can play an important role.

Finally, system-level views of the SoC design can be explored with Platform Architect. Initially used for memory and processing performance and power exploration, Platform Architect has recently been used more and more to understand the system-level performance and power related to AI. Sensitivity analysis can be performed to identify optimal design parameters using Synopsys IP with prebuilt models of LPDDR, ARC processors for AI, memories and more.

Overview

AI algorithms are driving constant change in hardware, and as these techniques move from the cloud to the edge, the technical issues to be optimized become more complex. To ensure competitive success, pre-silicon evaluations are becoming more important. Co-design of hardware and software has become a reality and the right tools and expertise are critical.

Synopsys has a proven portfolio of IP used in many AI SoC designs. Synopsys has an experienced team developing AI processing solutions from ASIP Designer to the ARC processors. A portfolio of proven Foundation IP, including memory compilers, has been widely adopted for AI SoCs. Interface IP for AI applications range from sensor inputs via I3C and MIPI to chip-to-chip connections via CXL, PCIe and Die-to-Die solutions, and networking via Ethernet.

Finally, Synopsys tools provide a method to leverage the expertise, services, and proven IP in an environment best suited to optimize your AI hardware in this ever-changing landscape.

Leave a Comment

Your email address will not be published. Required fields are marked *