Exclusive look inside the first American Exascale Supercomputer

HPCwire takes you to the Frontier data center at DOE’s Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tennessee, for an interview with Frontier Project Director Justin Whitt. Frontier, the first supercomputer to surpass 1 exaflops in the Linpack benchmark, took the number one spot in the Top500 in May and broke new ground in energy efficiency. The HPE/AMD system delivers 1,102 Linpack exaflops of computing power in a power envelope of 21.1 megawatts, an efficiency of 52.23 gigaflops per watt.

Whitt tells what it was like to establish America’s first exascale supercomputer, delving into the system details, the power and cooling requirements, the first applications running on the system, and what’s next for the leading computing facility.

Transcription (slightly edited):

Tiffany Trader: Hi Justine. I’m here with Justin Whitt. We are in front of the Frontier supercomputer, the HPE/AMD system that recently became the first to cross the Linpack exaflops milestone. Justin is Frontier’s project director. You and your team should feel pretty good about this.

Justin Whitt: We are very excited. It was quite an achievement. The team has worked hard. HPE and AMD’s business partners have worked extremely hard to make this possible. And we couldn’t be happier. It’s just great.

Trader: Well, congratulations. So tell us about the system. We are in front of some cabinets, can you tell us what’s inside?

Frontier sheet with diagram on display (yes, they all have diagrams)

Whitt: Secure. These are HPE Cray EX systems. We have 74 of these – 9,408 nodes. Each node has one CPU and four GPUs. The GPUs are the [AMD] MI250Xs. The CPUs are an AMD Epyc CPU. It’s all connected to the fast Cray interconnect called Slingshot. And it’s a water-cooled system. Last October we started purchasing hardware. We built the system, tested it and have been using it for a few months now.

Trader: So I understand the process of benchmarking it in time for the Top500 was spot on. Would you like to share something about that and what that experience was like?

Whitt: It was right on the wire. The funny thing about these systems is that they are so big that you can’t build them for the first time until all the hardware is there. So when the hardware arrived, we started putting things together, and it takes a while. Once we had it you know and had all the hardware working then you start tuning the system. And we’ve been in that mode for a few months now. Where we make adjustments during the day, fine-tune and check our work at night by running benchmarks on it and seeing how we’ve done. And we were running out of time, you know, the May list was coming up. And we planned on still running early, maybe mid-May, always at night with us and the engineers across the country, looking at the power profiles back home and saying, “Oh, this looks like a good run.” , or, Hey , let’s kill it, and let’s start it over.” And literally a few hours before the deadline, we were able to break a run that broke the exascale barrier.

Trader: That was 1.1 exaflops on the High Performance Linpack benchmark. And then the system also got very impressive number two on the Green500. And it’s its companion, the smaller test and development companion, the Frontier TDS – Borg I think you call it – it was number one with a pretty impressive energy efficiency rating.

Whitt: Yes, over 60 gigaflops per watt for a single cabinet. So very impressive. And actually, I think the top four spots on the Green500 were the same Frontier architecture.

Trader: And tell us a little more about the cooling. I know you’ve done a lot of facility upgrades for the power and cooling, and the computer is fully liquid cooled?

Whitt: It is, yes it is. So this is the data center where we used to have the Titan supercomputer. So we removed that supercomputer and spruced up this data center. We knew we needed more power and we needed more cooling. So we brought 40 megawatts of power to the data center. And we have 40 megawatts of cooling available. Frontier uses only about 29 megawatts of that at its peak. And so there was a lot of construction work to get that done and get the cooling in place before the system.

Trader: And does that liquid cooling dynamically adapt to the workloads?

Whitt: Yes, it does. These are incredibly instrumented machines right now, where even down to the individual components on the individual node boards, there are sensors monitoring the temperature so we can adjust the cooling levels up and down to make sure the system is running at a safe place. temperature.

Trader: And what would you say about the volume level in the room? We’re using a microphone here, but it’s really not too loud as far as data centers go.

Whitt: Correct. You probably visited during the Titan days where we would have worn earmuffs and we wouldn’t have this conversation. Summit was a lot quieter than that. And this is even a little quieter than Summit, so they get quieter because they’re liquid cooled. We have no fans. We don’t have back doors where we exchange heat with the room.

Trader: So it’s 100 percent liquid cooled, and the [fan] the noise we hear actually comes from the storage systems that are also HPE and are air cooled.

Whitt: Yeah, they’re a little louder, so they’re on the other side of the room, and you can… they’re pretty loud.

Trader: I understand you are coming to the acceptance process, how is that going?

Whitt: We’re actually getting to the point where we’re going to start the acceptance process. So basically we’ve done a lot of testing and tweaking with the pre-production software so far. And so we need to get all the production software on the system, you know, from the networking software to the programming environments and all that, what we’re going to use when we actually have researchers on the system. Once we’ve done that and everything is checked out, we’ll start the acceptance process on the machine.

Trader: So what’s running on Frontier now?

Whitt: So at this point we’re still doing some benchmark testing. And we also do a lot of checks on these new software packages that we use. So we set things up, we run benchmarks, we run real-world applications on the system to make sure that since we’ve upgraded the software, we haven’t introduced any new bugs into the system.

Trader: Is there a dashboard that you pull up and where you can see exactly what’s running on it?

Whitt: Correct. Correct.

Trader: That is cool.

Whitt: And you know, I mentioned all the instrumentation and sensors, on the same dashboard, we can look at temperatures down to the individual GPUs, to see how hot the GPUs are running, to see, you know, what the flow rates are through the system . It’s really impressive.

Trader: And what will be some of the very first workloads when it goes into early science?

Whitt: Here at OLCF [Oak Ridge Leadership Computing Facility], we have the Center for Accelerated Application Readiness, we call it CAAR. We jokingly say it’s our application readiness vehicle. That group supports eight apps for the OLCF and twelve apps for the Exascale Computing Project. So the plan is that on day one of the system, we’ll have more than 20 apps ready to do science.

Trader: They say exascale first day readiness is the slogan. And given the long-term procurement cycles for these massive instruments, you’re already planning for the next supercomputer after Frontier, which you’ll call OLCF-6. So how do you prepare for that system and where is it going?

Whitt: Yes, in project terms, you know, Frontier was OLCF-5, the next system will be OLCF-6. And we’re really just at the conceptual thinking stage about it right now. That system will probably go in this room, we have room for that system, both from a space and from a power and cooling perspective.

Trader: Partly because this [Frontier machines] are so compact that you needed fewer cabinets.

Whitt: That’s exactly right. Yes.

Trader: And then here’s Summit, a previous Top500 number one system, an IBM/Nvidia machine. What are your plans for Summit once you have Frontier in full production?

Whitt: Summit is still a great system. It is currently widely used. Even right now it’s, you know, probably 95 percent or maybe more full, with researchers running code on that system. And so it’s still a great system right now. Normally we like to overlap systems for at least a year so that we can make sure that Frontier is stable and give people time to transfer their data and their applications to the new system. But Summit is a really good system, so we’ll have to wait and see, but at least we’ll let it run for a year and overlap with Frontier.

Trader: And then a very important question. We talked a little bit about it, but maybe from a more personal point of view, looking at the science that makes Frontier and exascale possible, what are you most excited about?

Whitt: So I’m excited about a lot of different science, you know, really, with the scales of the systems, you know, you’re going to be able to approach problems that we’ve never been able to approach before. I am a CFD person by training. So I always have a soft spot for the CFD codes. But some of the most exciting things are the work in artificial intelligence and those workloads. You know, you have researchers looking at how to develop better treatments for different diseases, how to improve the effectiveness of treatments, and these systems are capable of processing incredible amounts of data. Think lab reports or pathology reports, thousands of them, and they can draw conclusions from these reports that no human could ever do, but that a supercomputer can. And some of them are really exciting to me.

Trader: Speaking of CFD, do you use computational fluid dynamics to model water flow in the cooling system?

Whitt: We are. Yes, we are. That’s a recent attempt.

Trader: That’s pretty neat. Okay. Thank you very much, we appreciate the tour.

Whitt: You are always welcome.

Trader: Congratulations.

Whitt: Thank you.

Leave a Comment

Your email address will not be published.