DCD Connect London was an event focused on helping future-proof data centers. A panel about how data centers can manage generative AI's computationally challenging workloads stood out as it proposed solutions to the high temperatures caused by AI's silicon components.
Omdia view
Summary
The recent DCD Connect London event was focused on helping future-proof data centers. The atmosphere was optimistic, and most talks were at full capacity. A panel about how data centers can manage the computationally challenging workloads of generative artificial intelligence (AI) stood out by proposing solutions to the high temperatures caused by AI’s silicon components.
Optimizing data center performance for generative AI workloads
For the second day of DCD Connect, the panel discussions focused mainly on AI, examining its use cases and evaluating its permanent effect on the data center landscape. From the morning panels, a discussion named “Does generative AI run on thin air?” gave valuable insights into how computationally demanding AI workloads affect data centers. The speakers were from the Research Institutes of Sweden (RISE): Tor Björn Minde, director, and Jon Summers, research lead in data centers.
The session started by highlighting that generative AI models are becoming increasingly complex and computationally challenging; at the same time, the growth rate of generative AI users is faster than that of Instagram. Tor Björn Minde explained that the number of parameters in generative AI models increases with time, making them more sophisticated. He stated: “GPT-4, for instance, uses 1.7 trillion parameters and 13 terra tokens, which equates to 2.15x1010 pflops of data requiring 25,000 A100 GPUs and 100 days to process, costing $100 million.” Additionally, training AI models consumes significant energy. As a result, data centers require effective cooling systems to moderate the temperature of their processors.
Co-speaker Jon Summers highlighted that, currently, there is a significant natural inefficiency in processors. He explained that, historically, around 65% of transistors on a processor do not function, known as “dark silicon,” and that, over time, as technology advances companies are reducing the dark silicon present. One example is a reduction of dark silicon from the A100 to the H100 GPUs from NVIDIA, starting at 65% and reducing to 51%.
Jon Summers delved deeper into the science behind reducing the dark silicon percentage. He explained that reducing dark silicon causes thermal design power (TDP) to rise. A higher TDP component generates more heat, which, at the current limit of AI processors, is starting to exceed the limitations of air cooling. The H100 can have a maximum temperature of 86 degrees when operating. Thermodynamics dictates that air must be 16 degrees to cool when operating, which is not feasible in most situations; hence, fluid would be the most effective solution to tackle the problem. He argued that the research points towards direct-to-chip cooling being the most effective solution because it boasts reduced energy consumption, increased processing capacity, reduced space usage, improved uptime, and less weight over immersion cooling.
Omdia view
Optimizing data center performance for generative AI workloads is critical due to the significant impact that these workloads have on both technological advancements and operational efficiency. Generative AI models, such as GPT-4, Bard, and Bing AI, have demonstrated the ability to create human-like text, images, and code at speeds substantially faster and more insightfully[S1][AL2] than the most proficient human, making them invaluable for various applications. Additionally, the number of AI servers is growing exponentially. Omdia predicts high double-digit growth in the next few years, so it is in the data centers’ best interests to prepare accordingly for future workloads.
Appendix
Further reading
Blockchain Technology and Adoption Trends (December 2019)
“Blockchain is good for more than just Bitcoin” (September 2019)
“CenturyLink goes ‘colorless’ and takes on the edge cloud” (February 2020)
Service Provider Routers & Switches Market Tracker – Q4 2019 (February 2020)
Li You, “Tech-savvy Hangzhou tries out new ‘City Brain’,” China Daily (retrieved June 17, 2021)
Author
Aaron Lewis, Analyst, Cloud and Data Centre