Footnotes are very important. They can reveal information that is vital to interpreting the metrics on display and sometimes they can also reveal caveats hidden in plain sight. AMD recently launched the world’s first 7nm GPU, the Radeon Instinct MI60, and it is a milestone in the ongoing transformation of AMD’s professional GPU side. The specifications are great and the performance spectacular, but the efforts put in by engineers might be overshadowed by something hidden in the footnotes. NVIDIA’s Tesla V100 GPU was gimped in the ResNet 50 benchmark.
NVIDIA’s Tesla V100 ResNet 50 AI benchmark in AMD Next Horizons event was running at 1/3rds of peak performance because of FP32 mode
See, the company had claimed comparable inference performance as compared to NVIDIA’s Tesla V100 flagship GPU. I remembered seeing ResNet 50 performance before and could distinctly remember it being in the 1000s so I looked through the footnotes and found the cause: the test was conducted in FP32 mode. The Tesla V100 contains Tensor cores and significantly more die space (the GCN architecture is hard-limited to 4096 stream processors) and those can be used to accelerate inference and learning performance by multiple factors. In fact, if you use Tensor mode, the performance of the V100 is just over three times that of the Radeon Instinct MI60.
I did not have an NVIDIA Tesla V100 lying around, so I reached out to NVIDIA and they quickly sent me the data for that particular benchmark running in Tensor mode (the advisory for not trusting first party benchmarks applies here too, but in this case, this result can and has been replicated by third parties). The Radeon Instinct MI60 according to AMD’s own testing yields about 334 images per second, while the NVIDIA Tesla V100 yields a maximum of 1189 images per second – a 3.5x speedup in performance. This speedup is in PCIe mode by the way: going to SXM2 results in an even higher differential.
That’s not all, NVIDIA’s Tesla T4 can actually yield 395 images per second in Tensor mode as well. NVIDIA had the following to say about the issue:
“The 70W Tesla T4 with Turing Tensor Cores delivers more training performance than 300W Radeon Instinct MI60. And Tesla V100 can deliver 3.7x more training performance using Tensor Cores and mixed precision (FP16 compute / FP32 accumulate), allowing faster time to solution while converging neural networks to required levels of accuracy.” – NVIIDA
GPUs take a long time to design and develop and it is clear that AMD got blindsided in the Tensor department. That said, while Tensor cores can and do speed up certain calculations, they do not work in every case and FP32 is still a very important metric of performance. So yes, the MI60 has performance comparable to the Tesla V100, but only in FP32 mode. Overall training performance is vastly superior on the V100. If you are someone who uses Tensor to accelerate inference then the T4 is going to be more of a competitor than the V100.
AMD’s point of view
Now, I reached out to AMD as well to give them a chance to reply and they had the following to say about it:
“Regarding the comparison – our footnotes for that slide clearly noted the modes so no issues there. Rationale is that FP32 training is used in most cases for FaceID to have 99.99%+ accuracy, for example in banking and other instances that require high levels of accuracy.” – AMD
I have to admit I am not familiar with FaceID and other mission-critical training sets so I will not go into a detailed deconstruction of this statement. It is possible that the use of FP16 inputs makes a difference to the final result that I’m not aware of. I’m willing to give AMD the benefit of doubt on this unless my better-peers prove otherwise, but even if that is the case, the fact remains that this was an instance of cherry-picked benchmarks and is somewhat of a disappointment coming from a company that usually retains a high moral ground in these things.
No one expects marketing material to be perfect, and that is something I am painfully aware of considering the recent splattering of bad press that seems to plague the PC triumvirate. It is also worth noting that this statement does not seem to be in agreement with what NVIDIA says. We know that Tensor cores are essentially mixed precision (FP16 multiply/FP32 accumulate) and NVIDIA claims you should be able to get to the “required level of accuracy” using those anyways.