AI factories are factories: Overcoming industry challenges to commoditize AI

This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Prosperity.” Read more from this special issue here.

This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Prosperity.” Read more from the issue here.

If you travel 60 years back to Stevenson, Alabama, you’ll find the Widows Creek Fossil Plant, a 1.6-gigawatt generating station with one of the tallest chimneys in the world. now, has a Google data center where the Widows Creek plant once stood. Instead of running on coal, the facility’s old transmission lines carry renewable energy to power the company’s online services.

That metamorphosis, from a carbon-burning facility to a digital factory, is emblematic of a global shift in digital infrastructure. And we’re about to see manufacturing intelligence kick into high gear thanks to AI factories.

These data centers are decision-making machines that consume computing, networking and storage resources as they transform information into insights. Data centers full of grew in record time to satisfy the insatiable demand for artificial intelligence.

The infrastructure to support AI inherits many of the same challenges that define industrial factories, from power to scalability and reliability, requiring modern solutions to the problems of the century.

The new labor force: Compute power

In the age of steam and steel, labor meant thousands of workers operating machinery around the clock. In today’s AI factories, output is determined by computing power. Training large AI models requires a lot of processing resources. According to Aparna Ramani, VP of engineering at Metathe growth of the training of these models is about a factor of four per year across the industry.

That level of scaling is on track to create some of the same bottlenecks that exist in the industrial world. There are supply chain constraints, to begin with. GPUs – the engines of the AI revolution – come from several manufacturers. They are extremely complex. They are in great demand. And so it is not surprising that they are subject to cost volatility.

In an effort to circumvent some of the supply constraints, big names like AWS, Google, IBM, Intel and Meta are designing their own custom silicon. These chips are optimized for power, performance and cost, making them specialists with unique features for their respective workloads.

This shift isn’t just about hardware, though. There is also concern about how AI technologies will impact the job market. Research published by Columbia Business School studied the investment management industry and found that the adoption of AI led to a 5% decline in the share of labor income, mirroring the shifts seen during the Industrial Revolution.

“AI is likely to be transformative for most, perhaps all, sectors of the economy,” said Professor Laura Veldkamp, one of the authors of the paper. “I am very optimistic that we will find useful work for many people. But there are switching costs. “

Where do we find the power to measure?

Cost and availability aside, the GPUs that serve as AI factory workers are notoriously power hungry. When the xAI team brings the Colossus supercomputer cluster online in September 2024, it will reportedly have access to somewhere between seven and eight megawatts from the Tennessee Valley Authority. But a cluster of 100,000 H100 GPUs requires more than that. Therefore, xAI brings in VoltaGrid mobile generators to temporarily make up the difference. In early November, Memphis Light, Gas & Water reached a more permanent agreement with TVA to provide xAI with an additional 150 megawatts of capacity. But critics counter that consumption at the site burdens the city’s grid and contributes to poor air quality. And Elon Musk already have plans for another 100,000 H100/H200 GPUs under the same roof.

According to McKinseythe power needs of data centers are expected to increase to approximately three times the current capacity by the end of the decade. At the same time, the rate at which processors double their performance efficiency is slowing. That means performance per watt is still improving, but at a slower pace, and certainly not fast enough to keep up with the demand for computing horsepower.

So, what does it take to match the feverish adoption of AI technologies? A report from Goldman Sachs suggests that US utilities need to invest about $50 billion in new generation capacity just to support data centers. Analysts also expect data center power consumption to drive about 3.3 billion cubic feet per day of new natural gas demand by 2030.

Scaling is more difficult as AI factories get bigger

Training the models that make AI factories accurate and efficient can take tens of thousands of GPUs, all working in parallel, months at a time. If a GPU fails during training, the run must be stopped, returned to a new checkpoint and resumed. However, as the complexity of AI factories increases, so does the likelihood of failure. Ramani addressed this concern during a AI Infra @ Scale presentation.

“Stopping and restarting is a pain. But it’s made worse by the fact that, as the number of GPUs increases, so does the probability of failure. And at some point, the number of failures can become too overwhelming to lose.” It will take us a lot of time to minimize these failures and you will almost complete a training run.

According to Ramani, Meta works in near-term ways to detect failures more quickly and to get back up and running faster. Further on the horizon, research into asynchronous training can improve fault tolerance while simultaneously improving GPU utilization and distributed training running across multiple data centers.

The ever-present AI will change the way we do business

Just as factories in the past relied on new technologies and organizational models to scale up the production of products, AI factories feed on computing power, networking infrastructure and storage to create tokens – the smallest piece of information used by the AI model.

“This AI factory produces, creates, produces something of great value, a new product,” Nvidia CEO Jensen Huang said at the time. Computex 2024 keynote. “It’s completely applicable to almost every industry. And that’s why it’s a new Industrial Revolution.

McKinsey says that generative AI has the potential to increase the equivalent of $2.6 to $4.4 trillion in annual economic benefits across 63 different use cases. In every application, whether the AI factory is hosted in the cloud, deployed in-house or self-managed, the same infrastructure challenges must be overcome, the same as in an industrial factory. According to the same McKinsey report, achieving even a quarter of that growth by the end of the decade will require another 50 to 60 gigawatts of data center capacity, to begin with.

But the result of this growth is poised to change the IT industry forever. Huang explained that AI factories will make it possible for the IT industry to generate intelligence for a $100 trillion industry. “It will be a manufacturing industry. Not an industry of manufacturing computers, but the use of computers in manufacturing. This has never happened before. It’s a unique thing.”

Daily insights into business use cases in VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.