This website uses cookies, pixels, and similar technologies (“cookies”), some of which are provided by third parties, to enable website features and functionality; measure, analyze, and improve site performance; enhance user experience; record user interactions; and support our advertising and marketing. We and our third-party vendors may monitor, record, and access information and data, including device data, IP address and online identifiers, referring URLs and other browsing information, for these and similar purposes. By clicking “Accept all cookies,” you agree to such purposes. If you continue to browse our site without clicking “Accept all cookies,” or if you click “Reject all cookies,” only cookies necessary to operate and enable default website features and functionalities will be deployed. If you are visiting our Site in the U.S., by using this site or clicking “Accept all cookies,” “Reject all cookies,” or “Preferences,” you acknowledge and agree to our Privacy Policy, Cookie Policy, and Terms of Use.

library

Blog
/

Why AI Needs a New Digital Infrastructure

Rita Waite, Partner | Esube Bekele, VP Technology | Abi Sivananthan, VP Technology
Read the Paper
Generative AI is fundamentally transforming industries at an unprecedented pace. Existing infrastructure, largely built to handle traditional computing workloads, is no longer sufficient to meet the requirements of these advanced systems.

Image generated using ChatGPT.

Generative AI is fundamentally transforming industries at an unprecedented pace. As models like Open AI’s GPT-4 and beyond become more widely adopted, their demand for robust and scalable digital infrastructure is becoming impossible to ignore. Existing infrastructure, largely built to handle traditional computing workloads, is no longer sufficient to meet the requirements of these advanced systems. That infrastructure includes computing resources optimized for traditional enterprise software-as-a-service (SaaS) applications and supercomputers designed for high performance compute (HPC) workloads.

The rapid evolution of AI is placing immense pressure on cloud providers, data centers, and national governments to rethink their approach to computing infrastructure. Without a significant upgrade, we risk hitting bottlenecks that could stall the immense progress AI is poised to deliver. In fact, recent reports suggest these bottlenecks have already begun to emerge, with signs that AI performance improvements are slowing and potentially plateauing.

Generative AI's Unique Infrastructure Demands

Several attributes make generative AI models meaningfully different from traditional workloads. Most commonly discussed is the sheer scale. The immense size of these new models means their need for high-performance distributed computing clusters, advanced networking, massive memory pools, and highly specialized storage systems is staggering.

Generative AI models require specialized hardware, such as CPUs, GPUs, TPUs, or ASICs, combined with specialized AI-powered software, to handle complex tasks such as natural language processing and image generation. These models process vast datasets and require ultra-low latency to deliver real-time results. This is very different from what infrastructure in most traditional data centers was designed to manage.

Moreover, generative AI workloads are notorious for using very large amounts of energy to run the massive scale compute, networking, and associated cooling elements they require. For instance, NVIDIA’s latest Blackwell GPUs feature power ratings of up to 1 kilowatt (KW) per chip and companies are building clusters using hundreds of thousands of these GPUs. As data centers scale to support these demands, they face power-related challenges that far exceed the capacities envisioned in their original designs. (These challenges are explained in more details ahead.)  

In addition to placing immense stress on national power grids, these energy demands require advanced thermal management systems to address cooling and other challenges. Compounding the issue are network bottlenecks. Generative AI models typically cannot run on a single processor; they require many processors working in parallel, necessitating advanced networking solutions to transfer data efficiently without introducing latency, which could severely hinder performance.

Why Existing Infrastructure is Outdated

Today’s data centers are reaching their physical limits in terms of power and cooling. They were primarily built for traditional IT workloads—handling relatively predictable tasks like processing transactions or storing data. Generative AI, by contrast, requires distributed computing clusters that must communicate rapidly, handle huge volumes of data, and ensure ultra-fast processing times.

Running AI models at scale comes with immense electricity demands—up to ten times that of standard compute operations. Maintaining a 100,000-GPU cluster for training large AI models, for example, can cost over $130 million annually in electricity alone. Current systems simply cannot sustain this level of resource consumption without radical redesign.

Infrastructure software also faces challenges. First, existing cloud and data center infrastructure software is not designed to handle the complex movement of data required for AI across a distributed data center structure. Second, the current software is primarily designed for CPU-dominated data centers and struggles to deal with GPUs, TPUs, and other chips optimized for AI. Lastly, generative AI infrastructure requires orchestration and optimization of a large number of CPU, GPU and AI-optimized ASIC cluster nodes distributed across large geographic areas. Traditional software used in data centers cannot provide this capability.

The Global Importance of AI Infrastructure

The pressure to upgrade infrastructure is not just being felt by individual corporations or cloud providers. At a national level, countries are recognizing AI infrastructure as a critical strategic asset. In leading markets such as the U.S., AI infrastructure is supported by a strong ecosystem of hyperscalers and innovative startups, positioning the U.S. as a global leader in AI. However, maintaining this leadership requires significant additional investment across America’s industrial and research complex.

Other nations, such as France and the UAE, are also stepping up, investing in sovereign AI capabilities to reduce reliance on external technology providers. As AI infrastructure increasingly becomes synonymous with national security and economic independence, governments are allocating resources to ensure their competitiveness in this space.

The Call for New AI Infrastructure

To keep pace with the rapid growth of generative AI, we need to rethink how we approach infrastructure. Retrofitting legacy systems or patching existing software is no longer sufficient. Instead, we must develop purpose-built hardware and software designed specifically to support AI’s unique demands.

Moreover, hardware and software must be treated as integrated components of the AI infrastructure stack, requiring simultaneous optimization. This contrasts with traditional IT, where these elements were often treated as independent layers to be optimized sequentially.

We are going to need specialized hardware, efficient networking, improved power solutions, optimized storage, and distributed infrastructure software capable of massive expansion to enable AI models to scale efficiently. The AI infrastructure of the future must be energy-efficient, able to handle highly distributed workloads, and flexible enough to adapt to the next generation of AI models.

Leadership in AI Requires Infrastructure Investment Now

The future of AI depends on our ability to solve the infrastructure challenges that we face today as quickly as possible. The demand for generative AI is only going to increase—and without a new foundation, the digital infrastructure that powers this transformative technology will struggle to keep up.  

For businesses, governments, and other organizations, now is the time to invest in building such new infrastructure to ensure sustainable growth and competitive strength in the global AI landscape. Those who do so will gain a significant advantage in the global AI race, while those who delay risk falling behind.  

Stay tuned as we continue this discussion on the blog and explore solving AI’s bottlenecks and the critical role of physical and software layers.