Table of Contents
Introduction
The NVIDIA A100 (Compute) GPU is an extraordinary computing device. It's not just for ML/AI types of workloads. General scientific computing tasks requiring high performance numerical linear algebra run exceptionally well on the A100. The NVIDIA RTX 30xx (GeForce) and "Quadro" RTX Ax000 (Professional) GPUs are also good for numerical computing tasks that don't require double precision floating point for accuracy i.e. FP64. However, the A100 excels at these workloads too in addition to making traditional high precision numerical computing tasks viable with GPU compute acceleration.
- The A100 provides exceptionally good double precision numerical computing performance.
- It's lower precision performance FP32, FP16 is also excellent and includes 32-bit Tensor-Cores, TF32, which when used with mixed precision can provide a large performance boost and still provide acceptable accuracy for many application. Especially for ML/AI model training.
- The memory performance of the A100 is also a big plus and can provide 5 times the performance of even the best dual socket CPU systems for memory-bound applications. The A100 GPU comes with 40 or 80 GB of memory!
I have run three "standard" HPC benchmarks that illustrate these remarkable performance characteristics of the A100.
The benchmarks HPC, HPC-AI, HPCG
- HPL: The HPL Linpack benchmark is used to rank the Top500 supercomputers and is an optimized measure of double precision floating point performance from matrix operations. The benchmark finds a solution to large dense sets of linear equations.
- HPL-AI: Mixed Precision Benchmark Is the same HPL benchmark but using lower/mixed precision that would more typically be used for training ML/AI models. On the A100 this utilizes TF32, 32-bit Tensor-Cores. This benchmark is now also part of the Top500 supercomputer rankings.
- HPCG: High Performance Conjugate Gradients, this is another benchmark used for ranking on the Top500 list. It is a multigrid preconditioned conjugate gradient algorithm, with sparse matrix-vector multiplication with global IO patterns. It is a workload typical of many problems involving numerical solutions of sets of differential equations. This is very much memory/IO-bound!
These 3 benchmarks provide a good measure of the numerical computing performance of a computer system. These are the benchmarks used to rank the largest supercomputer clusters in the world. Of course I'm running them on a single server or workstation. Still, having "grown up" with supercomputers I'm always impressed by the performance from a single node modern system. The 4 x A100 system I've tested provides more computing performance than the first multi million-dollar, Top500 supercomputer deployment I was involved with!
Keep in mind these are "Benchmarks"! I made an effort to find (large) problem sizes and good parameters that would showcase the hardware. Measured GPU performance is particularly sensitive to problems size (larger is generally better). For the GPUs I have used NVIDIA's optimized "NVIDIA HPC-Benchmarks 21.4" container from NGC. That is their Supercomputer benchmark set!
Test Systems
NVIDIA A100 system
- CPU – 2 x Intel Xeon Platinum 8180 28-core
- Motherboard – Tyan Thunder HX GA88-B5631 Rack Server
- Memory – 12 x 32GB Reg ECC DDR4 (384BG total)
- GPU – 1-4 NVIDIA A100 PCIe 40GB 250W
NVIDIA Titan-V system
- CPU – Intel Xeon W-2295 18 Core
- Motherboard – Asus WS C422 PRO_SE
- Memory – Kingston 128GB DDR4-2400 (8x16GB) [My personal system]
- GPU – 1-2 NVIDIA Titan-V PCIe 12GB
The other machines are from older CPU HPC benchmark posts (HPL and HPCG). See, for example the, recent Intel Rocket Lake post or AMD Threadripper Pro post for references.
Software
- Ubuntu 20.04
- NVIDIA driver 460
- NVIDIA HPC-Benchmarks 21.4 (NGC containers)
- NVIDIA Enroot 3.3 (for running containers )
Results
Here's the good stuff!
This chart is pretty telling! This is the same HPL Linpack problem run on the CPUs and GPUs. 4 x A100 out performs by a factor of 14 the best dual CPU system I've ever tested. (That dual Xeon is the new Intel Ice Lake which is very good!) There are arguments for using CPUs but, the A100 is very compelling!
Note: The important HPL.dat parameters I used for the GPU runs are listed on output lines below,
4 x A100
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR03L2L2 144000 288 2 2 48.26 4.122e+04
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0000115 ...... PASSED
================================================================================
2 x A100
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR03L2L2 100000 288 2 1 29.17 2.285e+04
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0042121 ...... PASSED
================================================================================
1 x A100
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR03L2L2 64000 288 1 1 15.99 1.094e+04
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0038464 ...... PASSED
================================================================================
2 x Titan V
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR03L2L2 56000 288 2 1 13.00 9.008e+03
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0041254 ...... PASSED
================================================================================
1 x Titan V
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR03L2L2 36000 288 1 1 5.59 5.567e+03
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0040158 ...... PASSED
================================================================================
HPL-AI
HPCG
NGC HPC Benchmark container setup with enroot
For the sake of repeatability and verification the setup and job run details are given here.
Setting up and running and optimizing these benchmarks is not a simple task. I would not have considered doing this "from scratch". Fortunately NVIDIA has the container setup they use for benchmarking large GPU cluster deployments available on NGC (NVIDIA GPU Cloud). The container used in this post is NVIDIA HPC-Benchmarks 21.4
This container is not available to be "pulled" without an NGC (NVIDIA Developer) account
Creating the container sandboxes with enroot
Please see my recent post Run "Docker" Containers with NVIDIA Enroot for an introduction and installation instructions for Enroot.
Import the container from NGC
enroot import 'docker://[email protected]#nvidia/hpc-benchmarks:20.10-hpl'
Literally use "$oauthtoken", then after hitting return you be asked for a password. That is where you paste in the token you saved earlier.
Do the same for the HPCG container,
enroot import 'docker://[email protected]#nvidia/hpc-benchmarks:20.10-hpcg'
Create the container sandboxes
enroot create --name nv-hpl-bench nvidia+hpc-benchmarks+20.10-hpl.sqsh
enroot create --name nv-hpcg-bench nvidia+hpc-benchmarks+20.10-hpcg.sqsh
Start your container sandbox
Before you start you containers you will need let the containers know you do not have any Mellanox Infinband networking devices (these really are supercomputer benchmarks)
export MELLANOX_VISIBLE_DEVICES="none"
Start your "container sandbox instance"
enroot start nv-hpl-bench
You would start the container sandbox for HPCG with a similar command but using the name you used when creating the sandbox (nv-hpcg-bench).
Command-lines for Running the Benchmarks
These benchmarks run with MPI. It took some experimenting to get the mpirun command-line options working correctly for these containers. (you're welcome)
In both containers the startup scripts, .dat files, and executables are in /working. If you setup enroot to mount your home directory (as in my post mentioned above) then you can copy that directory to your account home.
Here are examples, for all three benchmarks using 2 x A100, (or Titan V)
#HPL
mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 2 hpl.sh --dat ./HPL.dat --cpu-affinity 0:0 --cpu-cores-per-rank 4 --gpu-affinity 0:1
#HPL-AI
mpirun --mca btl smcuda,self -x MELLANOX_VISIBLE_DEVICES="none" -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 2 hpl.sh --xhpl-ai --dat ./HPL.dat --cpu-affinity 0:0 --cpu-cores-per-rank 4 --gpu-affinity 0:1
#HPCG
mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 2 hpcg.sh --dat ./hpcg.dat --cpu-affinity 0:0 --cpu-cores-per-rank 4 --gpu-affinity 0:1
–cpu-affinity is a : separated list of CPU locations for each MPI rank. The 2U 4 x A100 system I used had the PCIe root hub on CPU 0 for all 4 GPU slots. So 4 x GPU run would have –cpu-affinity 0:0:0:0 If the PCIe root hub had been dual then the GPU's would be on both CPUs and –cpu-affinity 0:0:1:1 would have been appropriate.
–gpu-affinity is the corresponding GPU list to use i.e. for all 4 A100's it was –gpu-affinity 0:1:2:3
The "mapping" can be complex (on a large cluster) but the above provides a simple working example for a single node.
What About the RTX3090?
Of course you may be curious about something like the RTX3090. The benchmarks ran the same way. However, the single RTX3090 failed the same way too (I did not have 2 to test with). The benchmark where it might have given good relative performance, HPL-AI, failed to pass residual error, same as a single A100. The RTX3090 is an excellent GPU for ML/AI model training with FP32 and Tensor Cores. My only hesitations with recommending the 3090 for single precision workloads is the high power requirements (350W) and large size of most of the available cards limiting multi-GPU design. We are seeing more conventional blower fan designs again thankfully and power limits can be set in software but, they are still in short supply.
The HPL double precision benchmark ran but, of course, the 3090 is using the GA102 GPU not the compute powerhouse GA100 so the results were over 20 times slower than a single A100.
The 3090 does have very good memory performance and it ran the same HPCG benchmark at about 60% of the performance of the A100.
Conclusion
The A100 is an incredible computing device! I've said that before about NVIDIA GPUs and will probably continue to do so. They just keep making extraordinary performance improvements. Especially on the high-end compute line.
I have been an advocate for using GPU acceleration for scientific programming since it was first possible to do so. It keeps becoming easier, and easier to code for GPU. NVIDIA's developer ecosystem is second to none. GPU acceleration has become so important that doing a machine learning framework without GPU support seems unreasonable. And, machine learning frameworks are becoming very good general purpose scientific computing systems often with very little effort needed to run on GPU. There has been in the past and there are in the works, other compute accelerator hardware projects but nothing has succeeded like NVIDIA GPUs. In fact, essentially all other accelerators have failed to reach significant acceptance and longevity. I strongly encourage you to try GPU acceleration in your new projects and ports of older work.
Happy computing! –dbk @dbkinghorn
Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.
Related Content
Why Choose Puget Systems?
Built Specifically for You
Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.
We’re Here, Give Us a Call!
We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!
Fast Build Times
By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry-leading ship time.
Lifetime Labor & Tech Support
Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.
Click here for even more reasons!