Table of Contents
Introduction
The second new NVIDIA RTX30 series card, the GeForce RTX3090 has been released.
The RTX3090 is loaded with 24GB of memory making it a good replacement for the RTX Titan… at significantly less cost! The performance for Machine Learning and Molecular Dynamics on the RTX3090 is quite good, as expected.
This post is a follow-on to the post from last week on the RTX3080
RTX3080 TensorFlow and NAMD Performance on Linux (Preliminary)
Testing with the RTX3090 went smoother than with the RTX3080, which had been uncomfortably rushed and problematic.
I was able to use my favorite container platform, NVIDIA Enroot. This is a wonderful user space tool to run docker (and other) containers in a user owned "sandbox" environment. Last week I had some difficulties that were related to incomplete installation of all driver components. Expect to see a series of posts soon introducing and describing usage of Enroot!
The HPCG (High Performance Conjugate Gradient) benchmark was added for this testing.
There were the same failures with the RTX3090 as with the RTX3080;
- TensorFlow 2 failed to run properly with a fatal error in BLAS calls
- My usual LSTM benchmark failed with mysterious memory allocation errors
- The ptxas assembler failed to run. This left ptx compilation to the driver which caused slow start up times for TensorFlow (a few minutes). See the output below,
2020-09-22 11:42:03.984823: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312]
Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'
Relying on driver to perform ptx compilation. This message will be only logged once.
The reference to "sm_86" is referring to the "compute level", 8.6, for the GA102 chip. The Ampere GA100 chip has the code "8.0" i.e. sm_80.
I used containers from NVIDIA NGC for TensorFlow 1.15, NAMD 2.13 and CUDA for HPCG. All of these applications were built with CUDA 11.
The current CUDA 11.0 does not have full support for the GA102 chips used in the RTX 3090 and RTX3080 (sm_86).
The results in this post are not optimal for RTX30 series. These are preliminary results that will likely improve with an update to CUDA and the driver.
IMPORTANT NOTICE
Shortly after finishing this post I was notified that CUDA has been updated to version 11.1 with support for RTX30 GPU's (GA102 and GA104 i.e. compute capability 8.6 sm_86) This is great news! As soon as the needed containers on NVIDIA NGC are updated I will be doing a fresh performance analysis. … Including multi-GPU…
This post may be short lived, and that's a good thing!
Test system
Hardware
- Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
- Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
- Memory: 6x REG ECC DDR4-2933 32GB (192GB total)
- NVIDIA RTX3090 RTX3080, RTX TITAN and RTX2080Ti
Software
- Ubuntu 20.04 Linux
- Enroot 3.3.1
- NVIDIA Driver Version: 455.23.04
- nvidia-container-toolkit 1.3.0-1
- NVIDIA NGC containers
- nvcr.io/nvidia/tensorflow:20.08-tf1-py3
- nvcr.io/hpc/namd:2.13-singlenode
- nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4 for HPCG)
Test Jobs
- TensorFlow-1.15: ResNet50 v1, fp32 and fp16
- NAMD-2.13: apoa1, stmv
- HPCG (High Performance Conjugant Gradient) "HPCG 3.1 Binary for NVIDIA GPUs Including Ampere based on CUDA 11"
Example Command Lines
- docker run –gpus all –rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.08-tf1-py3
- docker run –gpus all –rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
- python nvidia-examples/cnn/resnet.py –layers=50 –batch_size=96 –precision=fp32
- python nvidia-examples/cnn/resnet.py –layers=50 –batch_size=192 –precision=fp16
- namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd
- OMP_NUM_THREADS=24 ./xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80
Note: I listed docker command lines above for reference. I actually ran the containers with enroot
Job run info
- The batch size used for TensorFlow 1.15 ResNet50 v1 was 96 at fp32 and 192 at fp16 for all GPUs except for the RTX3090 which used 192 for both fp32 and fp16 (using batch_size 384 gave worse results!)
- The HPCG benchmark used defaults with the problem dimensions 256x256x256
HPCG output for RTX3090,
1x1x1 process grid
256x256x256 local domain
SpMV = 132.1 GF ( 832.1 GB/s Effective) 132.1 GF_per ( 832.1 GB/s Effective)
SymGS = 162.5 GF (1254.3 GB/s Effective) 162.5 GF_per (1254.3 GB/s Effective)
total = 153.8 GF (1166.5 GB/s Effective) 153.8 GF_per (1166.5 GB/s Effective)
final = 145.9 GF (1106.4 GB/s Effective) 145.9 GF_per (1106.4 GB/s Effective)
Results
These results we run on the system, software and GPU's listed above.
Benchmark Job | RTX3090 | RTX3080 | RTX Titan | RTX 2080Ti |
---|---|---|---|---|
TensorFlow 1.15, ResNet50 FP32 | 561 images/sec | 462 images/sec | 373 images/sec | 343 images/sec |
TensorFlow 1.15, ResNet50 FP16 | 1163 images/sec | 1023 images/sec | 1082 images/sec | 932 images/sec |
NAMD 2.13, Apoa1 | 0.0264 day/ns (37.9 ns/day) | 0.0285 day/ns (35.1 ns/day) | 0.0306 day/ns (32.7 ns/day) | 0.0315 day/ns (31.7 ns/day) |
NAMD 2.13, STMV | 0.3398 day/ns (2.94 ns/day) | 0.3400 day/ns (2.94 ns/day) | 0.3496 day/ns (2.86 ns/day) | 0.3528 day/ns (2.83 ns/day) |
HPCG Benchmark 3.1 | 145.9 GFLOPS | 119.3 GFLOPS | Not run | 93.4 GFLOPS |
Note: that the results using TensorFlow 15.1 are much improved for the older RTX20 series GPUs compared to past testing that I have done using earlier versions of the NGC TensorFlow 1.13 container. This is especially true for the fp16 results. I feel there is a possibility of significantly better results for RTX30 after they have become fully supported.
Performance Charts
Results from past GPU testing are not included since they are not strictly comparable because of improvements in CUDA and TensorFlow
TensorFlow 1.15 (CUDA11) ResNet50 benchmark. NGC container nvcr.io/nvidia/tensorflow:20.08-tf1-py3
The FP32 results show a good performance increase for the RTX30 GPUs and I expect performance to improve when they are more full supported.
I feel that the FP16 results should be much higher for the RTX30 GPUs since this should be a strong point, I expect improvement with CUDA a update. The surprising results were how much better the RTX20 GPUs performed with CUDA 11 and TensorFlow 1.15. My older results with CUDA 10 and TensorFlow 1.13 where 653 img/s for the RTXTitan and 532 img/s for the 2080Ti!
NAMD 2.13 (CUDA11) apoa1 and stmv benchmarks. NGC container nvcr.io/hpc/namd:2.13-singlenode
These Molecular Dynamics simulation tests with NAMD are almost surely CPU bound. There needs to be a balance between CPU and GPU. These GPU are so high performance that even the excellent 24-core Xeon 3265W is probably not enough. I will do testing using a a later time using AMD Threadripper platforms.
HPCG 3.1 (xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80) nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4)
I did not have the HPCG benchmark setup when I had access to the RTX Titan. HPCG is an interesting benchmark as it is significantly memory bound. The high performance memory on the GPUs has a large performance impact. The Xeon 3265W yields 14.8 GFLOPS. The RTX3090 is nearly 10 times that performance!
Conclusions
The new RTX30 series GPUs look to be quite worthy successors to the already excellent RTX20 series GPUs. I am also expecting that the compute performance exhibited in this post will improve significantly after the new GPUs are fully supported with a CUDA and driver update.
I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce RTX30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.
IMPORTANT NOTICE
Shortly after finishing this post I was notified that CUDA has been updated to version 11.1 with support for RTX30 GPU's (GA102 and GA104 i.e. compute capability 8.6 sm_86) This is great news! As soon as the needed containers on NVIDIA NGC are updated I will be doing a fresh performance analysis. … Including multi-GPU…
This post may be short lived, and that's a good thing!
Happy computing! –dbk @dbkinghorn
Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.
Related Content
Why Choose Puget Systems?
Built Specifically for You
Unlike a generic workstation or server, our systems are designed around your unique workflow and optimized for the work you do every day.
We’re Here, Give Us a Call!
We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!
Fast Build Times
By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer industry-leading ship times.
Lifetime Labor & Tech Support
Even when your parts warranty expires, we continue to answer your questions and service your computer system with no labor costs.