Leave Your Message

AI Server Cooling Challenges: A Battle Between Temperature and Performance

2025-02-10

AI servers, with their high parallel computing capabilities and densely packed computing nodes, generate significant heat within a confined space. This results in high power consumption. Under heavy load, the heat produced by the server struggles to dissipate effectively, leading to reduced hardware performance or even hardware damage. Therefore, efficient cooling is one of the major challenges faced by AI servers.

dferg1.jpg

Traditional Cooling Solutions: Challenges with High-Power AI Chips

Traditional server cooling solutions are similar to those used in general computing, with a primary focus on cooling the high-power-consuming chips. Typically, heat is transferred from the chips via heat pipes and heat spreaders to multi-fin heat sinks, and then actively dissipated by fans.

However, this air-cooling solution has proven inadequate for modern AI servers. The reason lies in the fact that the power consumption of high-performance AI chips increases significantly with computing power.

dferg2.jpg

What is the Cooling Limit of Air Cooling?

Research reports suggest that the cooling limit of air cooling is around 250W in a 2U server space, with cooling capabilities ranging from 400W to 600W in a 4U space.

For context, "U" is a standardized measurement defined by the Electronic Industries Alliance (EIA). One "U" is equivalent to a height of 4.445 cm (1.75 inches), with a standard server rack typically measuring 42U in height. However, the total number of 1U servers a rack can hold is usually limited by cooling constraints.

For example, when using NVIDIA H100 chips, a 4U rack is necessary when employing an air-cooling module.

Cooling Solutions: Liquid Cooling and Immersion Cooling

To tackle these cooling challenges, liquid cooling and immersion cooling have emerged as two leading solutions, especially in high-density environments where power per rack exceeds 30 kW. At this point, hotspot issues become more prominent, requiring advanced cooling strategies, such as liquid cooling. When power density reaches 60 kW to 80 kW per rack, direct chip-level liquid cooling becomes increasingly common.

Liquid Cooling: Efficient and Effective for High-Density AI Workloads

Liquid cooling works by circulating a cooling liquid (such as water, 3M Novec, or Fluorinert) through a cold plate that directly contacts components like CPUs or GPUs. The heat is absorbed by the liquid coolant and then transferred via a heat exchanger or radiator to the surrounding air. The cooled liquid is then recirculated, ensuring continuous cooling.

Compared to traditional air cooling, liquid cooling offers significantly higher efficiency, making it especially effective for managing AI workloads. Since liquid coolant is much more efficient at dissipating heat than air (by thousands of times per unit volume), liquid cooling systems are often employed to handle the massive heat generated in confined spaces. The liquid absorbs the heat from internal hardware and transports it to external mediums, such as air, for dissipation.

In high power-density environments, liquid cooling systems excel. However, it's important to note that while liquid cooling typically targets CPUs or GPUs, the system may still generate excess heat that requires additional air conditioning to cool other components in the room.

Immersion Cooling: A Revolutionary Approach to Cooling

Immersion cooling involves submerging electronic components in a non-conductive cooling liquid, such as 3M Novec or Fluorinert. The cooling liquid absorbs the heat generated by the components, which is then transferred via circulation to a heat exchanger and cooled before being recirculated.

Immersion cooling has gained significant attention in high-performance computing (HPC) data centers due to its ability to support higher power densities and lower power usage effectiveness (PUE). One of the major advantages of immersion cooling is its ability to cool not only CPUs but also other components like printed circuit boards (PCBs) or motherboards, which are typically challenging to cool with traditional methods.

Conclusion: Finding the Right Cooling Solution for AI Servers

As AI servers continue to push the boundaries of computing power, traditional air cooling solutions struggle to meet the cooling demands. Liquid cooling and immersion cooling offer promising alternatives, delivering higher cooling efficiency and supporting the growing power needs of modern AI workloads. However, each solution has its trade-offs and specific use cases, with liquid cooling being ideal for high-density environments and immersion cooling being suited for more comprehensive cooling needs. Ultimately, selecting the right cooling solution is crucial for maintaining AI server performance and reliability in the face of escalating power demands.