There’s no doubt that AMD’s graphics business has kept the company afloat when its CPU business stunk. More than once I saw quarterly numbers that showed all the profitability was coming from the GPU side of the market.
The split between it and Nvidia is about 2:1, according to Steam analytics. Nvidia just has tremendous momentum and hasn’t lost it. And it allowed the company to branch out into artificial intelligence (AI) so thoroughly that gaming has almost become secondary to the firm. Not that they are leaving gamers hanging; they just aren’t the top priority any more.
With AMD on the upswing on the CPU side, the company has decided to finally stop ceding the whole data center to Nvidia. And this week it introduced two new GPUs with the data center and HPC/AI workloads in mind.
The Radeon Instinct MI60 and MI50 are based on the company’s current Vega architecture and built on TSMC’s 7nm process. Rather than gamers, they are specifically positioned for machine learning, high-performance computing (HPC), and rendering applications.
MI60 will come with 32GB of ECC HBM2 (High-Bandwidth Memory) while the MI50 gets 16GB of memory. Both cards will have a memory bandwidth up to 1 TB/sec, which is vital to AI and HPC apps.
The cards will also support PCIe 4.0, which has twice the transfer rate of the more ubiquitous PCIe 3.0 and direct GPU-to-GPU links using AMD’s Infinity Fabric, the same fabric that connects the CPU cores in the Ryzen and Epyc chips. This fabric will offer up to 200 GB/sec of bandwidth between up to four GPUs and is three times faster than PCIe 4.0.
The cards come with native support for virtualization, allowing one card to be securely shared between multiple virtual machines. This is an important feature, as virtually all of the public cloud operators offer GPU-accelerated virtual machines.
AMD claims that the MI60 will be the fastest double-precision accelerator with performance of up to 7.4 TFLOPS, and the MI50 at 6.7 TFLOPS.
AMD plays catch-up with software
Along with the new GPUs, AMD also announced a new version of its open-source ROCm runtime, which provides libraries and framework support for HPC GPU-powered workloads on Linux.
This is where AMD is really behind the eight ball vs. Nvidia. All the great hardware in the world is worthless without software, and Nvidia has a 10-year lead when it comes to software for HPC and AI. Some very, very smart Stanford researchers developed the CUDA language in 2007 to take advantage of the parallel nature of GPUs, and Nvidia has done a bang-up job of getting CUDA into universities around the world.
Thousands of developers know CUDA, and Nvidia is not slowing down in its educational efforts. It’s pretty much become a de facto standard now for GPU processing and is why AMD is effectively shut out of the HPC and AI worlds.
Short of porting CUDA to Radeon (which isn’t going to happen), AMD is in a tough spot. Developers generally don’t like supporting a slew of languages, and if they know CUDA, they are going to stick with it. AMD’s real challenge is getting developers to adopt ROCm over CUDA, and that’s going to be a tough sell.