In March, Nvidia introduced its GH100, the first GPU based on the new “Hopper” architecture, which targets both HPC and AI workloads, and especially for the latter, supports an eight-bit FP8 floating point processing format. Two months later rival Intel jumped out Gaudi2the second generation of its AI training chip, which also has an FP8 format.
The FP8 format is important for a number of reasons, not least that until now there has been some sort of split between AI inference, done with low precision in integer formats (usually INT8 but sometimes INT4), with AI training can be performed with FP16, FP32 or FP64 precision and HPC with FP32 or FP64 precision. Nvidia and Intel both argue that FP8 can be used not only for inference, but in some cases for AI training, radically increasing the effective throughput of their accelerators.
This is important because switching back and forth between floating point and integer formats is a pain in the neck, and it’s a lot easier to just keep everything in floating point. In addition, at some point in the future, inferring to 8-bit FP8 and possibly even 4-bit FP4 formats means that valuable chip real estate dedicated to integer processors can be freed up and used for something else.
In a post-Moore’s Law world, every transistor is sacred, every clock cycle should be cherished. Companies are looking for more efficient ways to perform AI tasks at a time when processing speeds are just as fast as in the past. Organizations need to figure out how to improve processing capabilities — especially for training — using the power currently available. Lower precision data formats can help.
AI chip makers see the benefits. In June, Graphcore released a Study of 30 pages demonstrated not only the superior performance of low-precision floating point formats over similarly sized scaled integers, but also the long-term benefits of reducing power consumption in training initiatives with rapidly growing model sizes.
“Low-precision numerical formats could be an important part of large machine learning models that provide state-of-the-art accuracy while reducing their environmental impact,” the researchers wrote. “In particular, by using 8-bit floating point arithmetic, energy efficiency can be increased to 4× with respect to float-16 arithmetic and to 16× with regard to float-32 arithmetic.”
Now Graphcore is hitting the drums to make the IEEE adopt the vendor FP8 format designed for AI as the standard anyone can work with. The company made its pitch this week, with Graphcore co-founder and chief technology officer Simon Knowles saying the “arrival of 8-bit floating point brings huge performance and efficiency benefits to AI computing. It’s also an opportunity for the industry to settle for a single open standard, rather than adopt a confusing mix of competing formats.”
AMD and Qualcomm also support Graphcore’s initiative, with John Kehrli, Senior Director of Product Management at Qualomm, saying the proposal “has emerged as an attractive format for 8-bit floating point compute, offering significant performance and efficiency gains.” for inference and can help reduce education and inference costs for cloud and edge.”
AMD is expected to support the FP8 format in the upcoming Instinct MI300A APU, which squeezes an AMD GPU and an Epyc 7004 processor into one package. We expect that there will also be regular MI300 discrete GPUs and they will also support FP8 data and processing.
It would also benefit the range of AI chip makers including SambaNova, Cerebras and Groq.
Graphcore argues that using lower and mixed precision formats – such as using 16-bit and 32-bit together – is common in AI and strikes a good balance between accuracy and efficiency at a time when Moore’s Law and Dennard Scaling to slow down.
FP8 gives the AI industry an opportunity to embrace an “AI-native” standard and interoperability between systems for both inference and training. Graphcore will also give its specification to others in the industry until the IEEE formalizes a standard.
“With the continued increase in the complexity of deep learning applications, the scalability of machine learning systems has also become indispensable,” the Graphcore researchers wrote in their paper. “Training large distributed models presents a number of challenges, depending on the effective use of the available computing, memory and network resources shared between the different nodes, constrained by the available energy budget. In this context, the use of efficient numerical formats is critical as it allows increased energy efficiency due to both improved computational efficiency and communication efficiency in the exchange of data between processing units.
Chipmakers have been evaluating the use of lower precision formats for some time now. in 2019, IBM Research unveiled a four-core AI chip based on 7 nanometer EUV technology that supported both FP16 and hybrid FP8 formats for both training and inference.
“This new hybrid method of training fully preserves model accuracy across a wider spectrum of deep learning models,” experts from IBM Research wrote in a blog post† “The hybrid FP8-bit format also overcomes the loss of accuracy of previous training on models such as MobileNet (Vision) and Transformer (NLP), which are more susceptible to information loss through quantization. To address this challenge, the Hybrid FP8 scheme uses a new FP8 bit format in the forward path for higher resolution and a different FP8 bit format for gradients in the reverse path for greater range.”
Two years later, IBM presented a test chip to the 2021 ISSCC event that supported 16- and 8-bit training and 4- and 2-bit inferences.
“The sophistication and adoption of AI models is expanding rapidly and is now being used for drug discovery, modernizing legacy IT applications, and writing code for new applications,” IBM researchers wrote at the time. “But the rapid evolution of the complexity of AI models is also increasing the energy consumption of the technology, and a major problem has been creating advanced AI models without increasing the carbon footprint. Historically, the field has simply accepted that if the computational need is great, so will be the power needed to fuel it.”
Now the ball is in the IEEE’s court to bring everyone together and create a standard.