Intel recently launched its Xeon Scalable processor family based on the 14nm Skylake-SP microarchitecture. At the high end, the Xeon Platinum 81xx series processors have up to 28 cores and up to 3.6Ghz in frequency. But at 200+ watts (initially!) and socket flexibility enabling support for 2, 4, and 8+ socket configurations, the cooling of server nodes and racks provide immediate challenges for HPC clusters and node designs.
It is “immediate” for HPC because supporting sustained computing throughput is required for HPC applications. Unlike most enterprise computing today, HPC is characterized by clusters and their nodes running at 100% utilization for sustained periods. These applications are compute limited and true HPC requires the highest performance versions of the latest CPUs and GPUs. As such, the highest frequency offerings of Intel’s Xeon Phi 72xx MIC GPUs (Knight’s Landing), Nvidia’s P100 (300 watts today) and Intel’s Platinum Xeon Scalable Processers (Skylake) are required along with their corresponding heat.
Unless you reduce node and rack density, the wattages of CPUs and GPUs are simply no longer addressable with air cooling alone. For HPC in the near term, (and enterprise computing longer term) an inflection point has been reached in the relationship between server density, the wattage of key silicon components and heat rejection.
At the 2017 International Supercomputing Conference in Frankfurt, Germany, the exhibit floor was awash with liquid cooling approaches for x86-based clusters. Many might be framed as aspirational “science projects” but there were also burgeoning approaches such as direct-to-chip liquid cooling (which has nine systems in the Top500 and Green500). In addition, Fujitsu announced its plan to build what will be the largest supercomputer in Taiwan at the National Center for High-Performance Computing (NCHC) using direct-to-chip hot water liquid cooling from Asetek.
Fortunately, proven and highly flexible direct-to-chip liquid cooling provides a low risk and high reliability path for cooling the new generation Xeons thanks to Knight’s Landing track record. Intel’s Xeon Phi 72xx MIC GPUs (Knight’s Landing) has been out for over a year with wattages ranging from 215-275 watts. It has been successfully implemented in numerous HPC sites using direct-to-chip liquid cooling such as the Top500’s Regensburg QPACE3, one of the first Intel Xeon Phi based HPC clusters in Europe.
The largest Xeon Phi direct-to-chip cooled system today is Oakforest-PACS system, located in the Information Technology Center on the University of Tokyo’s Kashiwa Campus. It is managed jointly by the University of Tokyo and the University of Tsukuba. The system is made up of 8,208 computational nodes using Asetek Direct-to-Chip liquid cooled Intel Xeon Phi high performance processors with Knights Landing architecture. It is the highest performing system in Japan and #7 on the Top500 list.
While reliability is a given, implementation of liquid cooling during a technology transition (as we are seeing with the latest wattages of processors) requires an architecture that is flexible in detail to adapt to a variety of heat rejection scenarios without being cost prohibitive. At the same time, it must be able to be adapted quickly to the latest server designs to allow for a smooth transition which is manageable by both OEMs and HPC users. This allows for incremental incorporation in moving HPC installations from air cooling to liquid cooling.
The accelerating success of Asetek’s Direct-to-Chip approach allows for this flexibility thanks to its distributed liquid cooling architecture. The distributed cooling architecture addresses the full range of heat rejection scenarios. It is based on low pressure, redundant pumps and closed loop liquid cooling within each server node. This approach allows
for a high level of flexibility.Asetek’s distributed pumping approach is based on placing coolers (integrated pumps/cold plates) within server and blade nodes themselves. These coolers replace the CPU/GPU heat sinks in the server nodes to remove heat with hot water rather than much less efficient air. Asetek has over 4 million of these types of coolers deployed worldwide and, as of this writing, the MTTF of these pumps at production HPC sites in in excess of 37,000 years.
Unlike centralize pumping systems, distributed pumping isolates the pumping function within each server node, allowing for very low pressures to be used (4psi typical).
This mitigates failure risk and reduces the complexity, expense and high pressures required in centralized pumping systems. In most cases, there are multiple CPUs or GPUs in a given node enabling redundancy at the individual server level as a single pump is sufficient to do the cooling.
Because of the low pressure, Asetek is able to use non-ridged tubing within the server, allowing it to be quickly adaptable to OEM server designs. In addition, air cooled designs are able to be liquid cooled to support higher wattage CPU/GPUs without needing entirely new layouts.
This flexibility also means the liquid cooling circuit in the server can also easily incorporate memory, VRs and other high wattage components into the low PSI redundant pumping circuit.
Distributed pumping at the server, rack, cluster and site levels deliver flexibility in the areas of heat capture, coolant distribution and heat rejection that centralized pumping cannot.
Beyond HPC, the requirement for liquid cooling is emerging. Enterprise and Hyperscale sites can be expected to take advantage of other Xeon Scalable Processers, in particular the Gold series. Xeon Gold 61xx and 51xx series processors feature up to 22 cores and target high-performance, but more mainstream applications than the higher-end Platinum series. Yet these new processors are also available up to 3.4GHz and support both two and four socket configurations. High performance enterprise and hyperscale systems are about to confront the same issues that are making liquid cooling a must-have for HPC. And, direct-to-chip liquid cooling is ready and able to meet this need.