RTX3070 (and RTX3090 refresh) TensorFlow and NAMD Performance on Linux (Preliminary)

Introduction

This post is a results refresh to include "preliminary" findings for the new RTX3070 GPU. Results from the RTX3090 post will be included, with a few job refreshes.

RTX3090 TensorFlow and NAMD Performance on Linux (Preliminary)

My colleagues have had mostly good results on various Windows applications with the RTX3070 and I believe it is also a very good gaming card. My testing is concerned with compute performance! (ML/Ai and molecular modeling)

The RTX3070 has only 8GB of memory making it less suitable for ML/AI and other computing work. However, at $500 I was hopeful that it would be a nice GPU for entry level compute tasks in a modest workstation build. From my current testing at this point I would recommend saving up for a RTX3080 or 3090. (This recommendation may change after new drivers and CUDA updates are released.)

This round of testing had much fewer problems than previously seen. There are new drivers now and updates on the NVIDIA NGC containers I've been using.

I used my favorite container platform, NVIDIA Enroot. This is a wonderful user space tool to run docker (and other) containers in a user owned "sandbox" environment. Which I plan to write about soon.

There were no significant job run problems! The NGC containers tagged 20.10 for TF1 and TF2 are working correctly.

  • TensorFlow 2 is now running properly. NGC container tagged 20.10-tf2-py3 is working (but not tested in this post)
  • The ptxas assembler is running correctly.

I used the latest containers from NVIDIA NGC for TensorFlow 1.15.

Test system

Hardware

  • Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
  • Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
  • Memory: 6x REG ECC DDR4-2933 32GB (192GB total)
  • NVIDIA RTX3070 RTX3090, (old results for RTX3080, TITAN and RTX2080Ti)

Software

  • Ubuntu 20.04 Linux
  • Enroot 3.3.1
  • NVIDIA Driver Version: 455.38
  • nvidia-container-toolkit 1.3.0-1
  • NVIDIA NGC containers
  • nvcr.io/nvidia/tensorflow:20.10-tf1-py3
  • nvcr.io/hpc/namd:2.13-singlenode
  • nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4 for HPCG)

Test Jobs

  • TensorFlow-1.15: ResNet50 v1, fp32 and fp16
  • NAMD-2.13: apoa1, stmv
  • HPCG (High Performance Conjugant Gradient) "HPCG 3.1 Binary for NVIDIA GPUs Including Ampere based on CUDA 11"

Example Command Lines

  • docker run –gpus all –rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.10-tf1-py3
  • docker run –gpus all –rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
  • python nvidia-examples/cnn/resnet.py –layers=50 –batch_size=32 –precision=fp32
  • python nvidia-examples/cnn/resnet.py –layers=50 –batch_size=64 –precision=fp16
  • namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd
  • OMP_NUM_THREADS=24 ./xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

Note: I listed docker command lines above for reference. I actually ran the containers with enroot

Job run info

  • The batch size used for TensorFlow 1.15 ResNet50 v1 was 32 at fp32 and 64 at fp16 for the RTX3070. GPUs The RTX3090 used 192 for both fp32 and fp16.
  • The HPCG benchmark used problem dimensions 128x128x128 (reduced for the 8GB mem on the RTX3070)

HPCG output for RTX3070

1x1x1 process grid
128x128x128 local domain
SpMV  =   64.2 GF ( 404.3 GB/s Effective)   64.2 GF_per ( 404.3 GB/s Effective)
SymGS =   77.5 GF ( 598.2 GB/s Effective)   77.5 GF_per ( 598.2 GB/s Effective)
total =   73.3 GF ( 555.9 GB/s Effective)   73.3 GF_per ( 555.9 GB/s Effective)
final =   72.3 GF ( 548.7 GB/s Effective)   72.3 GF_per ( 548.7 GB/s Effective)

HPCG output for RTX3090,

1x1x1 process grid
256x256x256 local domain
SpMV  =  132.1 GF ( 832.1 GB/s Effective)  132.1 GF_per ( 832.1 GB/s Effective)
SymGS =  162.5 GF (1254.3 GB/s Effective)  162.5 GF_per (1254.3 GB/s Effective)
total =  153.8 GF (1166.5 GB/s Effective)  153.8 GF_per (1166.5 GB/s Effective)
final =  145.9 GF (1106.4 GB/s Effective)  145.9 GF_per (1106.4 GB/s Effective)

Results

These results we run on the system, software and GPU's listed above.

Benchmark Job RTX3090 RTX3080 (old) RTX Titan (old)RTX 2080Ti (old)RTX3070
TensorFlow 1.15, ResNet50 FP32 577 images/sec 462 images/sec 373 images/sec 343 images/sec 258 images/sec
TensorFlow 1.15, ResNet50 FP16 1311 images/sec 1023 images/sec 1082 images/sec 932 images/sec 254 images/sec
NAMD 2.13, Apoa1 (old) 0.0264 day/ns
(37.9 ns/day)
0.0285 day/ns
(35.1 ns/day)
0.0306 day/ns
(32.7 ns/day)
0.0315 day/ns
(31.7 ns/day)
0.0352 day/ns
(28.4 ns/day)
NAMD 2.13, STMV (old) 0.3398 day/ns
(2.94 ns/day)
0.3400 day/ns
(2.94 ns/day)
0.3496 day/ns
(2.86 ns/day)
0.3528 day/ns
(2.83 ns/day)
0.355 day/ns
(2.82 ns/day)
HPCG Benchmark 3.1 145.9 GFLOPS 119.3 GFLOPS Not run 93.4 GFLOPS 72.3 GFLOPS

Note: (old) means that the results were not updated from those presented in the first RTX3090 performance post. The RTX3090 results are updated using the new driver and updated NGC TF1 container. The HPCG and NAMD results for the 3090 are from the older post (I did recheck them but there was little change).

Performance Charts

Results from GPU testing pryor to the release of RTX3080 are not included in the charts since they are not strictly comparable because of improvements in CUDA and TensorFlow for the RTX20 series GPU's


TensorFlow 1.15 (CUDA11) ResNet50 benchmark. NGC container nvcr.io/nvidia/tensorflow:20.10-tf1-py3

TensorFlow ResNet50 FP32

The FP32 results for the RTX3070 show performance on par with an older RTX 2080 (not tested)

TensorFlow ResNet50 FP16

The fp16/Tensorcore performance is very poor for the RTX3070. I'm assuming that this is an issue with the early release driver(?) I will retest at a later time when I do a more complete GPU performance roundup.

Note, that the fp16 performance for the RTX3090 is significantly improved with the new container build and driver update. The previous post for the RTX3090 had 1163 img/sec.


NAMD 2.13 (CUDA11) apoa1 and stmv benchmarks. NGC container nvcr.io/hpc/namd:2.13-singlenode

NAMD Apoa1
NAMD STMV

These Molecular Dynamics simulation tests with NAMD are almost surely CPU bound. There needs to be a balance between CPU and GPU. These GPU are so high performance that even the excellent 24-core Xeon 3265W is probably not enough. I will do testing at a later time using AMD Threadripper and EPYC high core count platforms.


HPCG 3.1 (xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80) nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4)

HPCG

HPCG is an interesting benchmark as it is significantly memory bound. The high performance memory on the GPUs has a large performance impact. The Xeon 3265W yields 14.8 GFLOPS. The RTX3090 is nearly 10 times that performance! the performance of the RTX3070 is limited by it's memory size and lower memory bandwidth.

Conclusions

The new RTX3070 GPU is lacking in compelling performance from this current testing. That may change when better updates for GA103 chip are available. For now I would recommend going with RTX3080 or better, the RTX3090 for compute rather than the RTX3070.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce RTX30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). There is also no P2P support on the PCIe bus. However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! –dbk @dbkinghorn


CTA Image
Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.

Configure a System!
CTA Image
Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

gears icon

Built Specifically for You

Unlike a generic workstation or server, our systems are designed around your unique workflow and optimized for the work you do every day.

people icon

We’re Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

delivery icon

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer industry-leading ship times.

repair icon

Lifetime Labor & Tech Support

Even when your parts warranty expires, we continue to answer your questions and service your computer system with no labor costs.

Click here for even more reasons!