Important developments from NVIDIA the world’s leading AI systems technology company.
JENSEN HUANG. (46:29) So this is now 720 petaflops almost an exaflop for training and the world’s first one exaflop machine in one rack. just so you know there only a couple two three exop flops machines on the planet as we speak and so this is an ex flops AI system in one single rack well let’s take a look at the back of it so this is what makes it possible that’s the back that’s the that’s the back the dgx MV link spine 130 terabytes per second goes through the back of that chassis that is more than the aggregate bandwidth of the internet so we we could basically send everything to everybody within a second and so so we we have 5,000 cables 5,000 mvlink cables in total two miles now this is the amazing thing if we had to use Optics we would had to use transceivers and retim and those transceivers and retim alone would have cost2 20,000 watts 2 kilow of just transceivers alone just to drive the MV link spine as a result we did it completely for free over mvlink switch and we were able to save the 20 kilow for computation this entire rack is 120 kilow so that 20 kilowatts makes a huge difference it’s liquid cooled what goes in is 25° C about room temperature what comes out is 45° C about your jacuzzi so room temperature goes in jacuzzi comes out 2 L per second we could we could sell a peripheral 600,000 Parts somebody used to say you know you guys make gpus and we do but this is what a GPU looks like to me when somebody says GPU I see this two years ago when I saw a GPU was the hgx it was 70 lb 35,000 Parts our gpus now are 600,000 parts and 3,000 lb 3,000 lb
NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference
By Ivan Goldwasser, Harry Petty, Pradyumna Desale and Kirthi Devleker
Mar 18, 2024
What is the interest in trillion-parameter models? We know many of the use cases today and interest is growing due to the promise of an increased capacity for:
- Natural language processing tasks like translation, question answering, abstraction, and fluency.
- Holding longer-term context and conversational ability.
- Multimodal applications combining language, vision, and speech.
- Creative applications like storytelling, poetry generation, and code generation.
- Scientific applications, such as protein folding predictions and drug discovery.
- Personalization, with the ability to develop a consistent personality and remember user context.
The benefits are great, but training and deploying large models can be computationally expensive and resource-intensive. Computationally efficient, cost-effective, and energy-efficient systems, architected to deliver real-time inference will be critical for widespread deployment. The new NVIDIA GB200 NVL72 is one such system up to the task.
To illustrate, let’s consider the Mixture of Experts (MoE) models. These models help distribute the computational load across multiple experts and train across thousands of GPUs using model parallelism and pipeline parallelism. Making the system more efficient.
However, a new level of parallel compute, high-speed memory, and high-performance communications could enable GPU clusters to make the technical challenge tractable. The NVIDIA GB200 NVL72 rack-scale architecture achieves this goal, which we detail in the following post. [pdf]