Table of Contents
- Hardware
- NAMD
- Running the Million Atom Simulation, STMV (Satellite Tobacco Mosaic Virus) Benchmark.
- Results
I have been doing validation and performance testing on a very nice dual Xeon-Scalable system that supports up to 8 GPU’s. It’s been a very impressive system! This post will look at the molecular dynamics program, NAMD. NAMD has good GPU acceleration but is heavily dependent on CPU performance as well. It achieves best performance when there is a proper balance between CPU and GPU. The system under test has 2 Xeon 8180 28-core CPU’s. That’s the current top of the line Intel processor. We’ll see how many GPU’s we can add to those Xeon 8180 CPU’s to get optimal CPU/GPU compute balance with NAMD.
I recently wrote about performance with TensorFlow that shows off the GPU performance on this system, TensorFlow Scaling on 8 1080Ti GPUs – Billion Words Benchmark with LSTM on a Docker Workstation Configuration. For that testing the Xeon 8180’s were way more capable CPU’s than were needed for that workload. Here I’ll focus more on the CPU performance (but still include GPU’s for some of the testing).
Hardware
The relevant components of the system under test was as follows,
- Mother board: TYAN S7109GM2NR-2T [Dual root complex with 4 PLX PEX8747 PCIe switches] In chassis B7109F77DV14HR-2T-N
- CPU’s: 2 x Intel Xeon Scalable Platinum 8180 CPU @ 2.50GHz 28-Core
- Memory: 768GB DDR4 REG ECC 12 x 64GB 2666MHz
- GPU’s: 8 x NVIDIA 1080Ti
Why 1080Ti’s? That’s what I had 8 of! … besides, they are great GPU’s for code that is optimized with single precision GPU acceleration like NAMD.
NAMD
NAMD is a molecular dynamics program developed by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign. The group at UIUC working on NAMD were early pioneers of using GPU’s for compute acceleration and NAMD has very good GPU acceleration but has a large CPU performance dependence. I consider it “CPU bound”.
Two NAMD versions used for testing
I used two NAMD builds for this testing.
- For the GPU acceleration and scaling testing I used the GPU optimized version 2.12 docker image from the NVIDIA NGC registry. (information below).
- For the CPU scaling testing I used the CPU binary build version 2.12 from the NAMD site at UIUC.
There is a detailed described of how to setup an Ubuntu 16.04 workstation for use with docker and NVIDIA NGC in the following posts,
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 1 Introduction and Base System Setup
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 2 Docker and NVIDIA-Docker-v2
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 3 Setup User-Namespaces
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 4 Accessing the NGC Registry
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning
Running the Million Atom Simulation, STMV (Satellite Tobacco Mosaic Virus) Benchmark.
GPU Job Runs
In the post in the links above, “Part 4 Accessing the NGC Registry”, there is information about about accessing the NGC registry. For this testing I used a container instance from the HPC directory on NGC, nvcr.io/hpc/namd:2.12-171025
. That container instance contains a directory with NAMD builds for multicore + CUDA in 3 varieties, “standard”, “memory optimized”, and a build with Infiniband support. There is an “examples” directory with the needed files for jobs apoa1 and stmv. apoa1 is way too small for testing on this machine but the stmv job is a good standard benchmark. The example configuration for stmv in this container image is setup to use the memory optimized runtime. I tested that and the performance was not as good as with the “standard” CUDA build.
I used the stmv input file that I have used in all of my past NAMD testing posts. I have it configured for 500 time steps and I use the last reported "day/ns"
value as my benchmark. This input file and it’s supporting files are available from the NAMD Utilities page.
After docker log-in to the NGC registry the NAMD container is started with the following command,
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/hpc/namd:2.12-171025
The stmv job directory I use is in “projects” on the host so I bind that directory into the container.
From the “projects” directory, where I have the stmv input files, the command to run the job is,
/opt/namd/namd-multicore +p56 +setcpuaffinity +idlepoll +devices 0,1,2,3,4,5,6,7 stmv.namd
- The
+devices
flag is followed by the list of GPU’s to be used. In that example it would be all 8 GPU’s. I varied that from 1 to 8 GPU’s for the testing results. +p56
was used for the GPU testing to provide all 56 CPU cores.
CPU Job Runs
For CPU performance scaling testing the non-CUDA multicore NAMD build discussed in the NAMD version section was used on the host system i.e. not with docker.
The way I had my directory layout configured the job run command lines were,
../../NAMD_2.12_Linux-x86_64-multicore/namd2 +p56 +setcpuaffinity +idlepoll stmv.namd
- The
+p
flag was varied from 1 to 56 cores.
Note: It really doesn’t make sense in most cases to run NAMD on CPU-only, because there is a very good performance boost from adding GPU’s. However, I wanted to look at CPU performance on these high-end Xeon 8081 CPU’s and NAMD scales very well on CPU (and multi-node).
Results
The NAMD performance on this system is better than any system I have tested previously, by a factor greater than 2. That includes previous generation quad-socket high-end Xeon systems with multiple GPU’s.
For a performance measure of the job run results in the tables and plots I am using Nano-seconds Per Day (ns/day) rather than “Day Per Nano-second” (day/ns) which is what is reported in the NAMD job output.
GPU accelerated results
NAMD STMV Benchmark on Dual Xeon 8180 (56 total cores) and 1-8 1080Ti GPU’s
Number of GPU’s | Simulation Nano-seconds Per Day | Performance Increase | % Efficiency |
---|---|---|---|
1 | 2.288 | 1 | 100% |
2 | 4.032 | 1.76 | 88.1% |
4 | 4.975 | 2.17 | 54.4% |
6 | 5.291 | 2.31 | 38.5% |
8 | 5.882 | 2.57 | 32.1% |
The multi-GPU scaling in this table is not surprising since this code is CPU bound when so many GPU’s are used. What is surprising, is that nearly 6 nano-seconds of dynamics simulation for this million atom system can be achieved in 1 day on a single node. That is very good! The plot below will better illustrate the scaling.
Fitting the data to an Amdhal’s Law curve gives a parallel fraction of P = 0.70. [That means that the maximum speedup achievable is unlikely to exceed 1/(1-P) = 3.3 with any number of GPU’s in the system.] More CPU performance to balance the GPU’s would likely be needed to get better overall scaling.
Here’s the expression of Amdhal’s Law that was used for a regression fit of the data to,
performance_(ns/day) = 2.288_(ns/day)/((1-P)+(P/num_GPU's))
Below is a plot of that curve,
CPU Only Results
Part of the purpose of this testing is to look at the scaling performance of these high-end Xeon 8180 CPU’s. NAMD actually scales nearly perfectly in parallel. However, the Intel-Scalable CPU’s reduce the core clock as the number of in-use cores increases. (And, there are are separate clocks for core-only, AVX2 and the new AVX512 vector units!) On the Xeon 8180 CPU’s the AVX512 clock changes from 3.5GHz for 1-2 cores down to 2.3GHz for 25-28 cores. It’s that clock reduction that actually reduces the apparent parallel scaling with NAMD. Still, it is interesting to see the “real” performance scaling on these (amazingly good) processors. I added the AVX512 core clock speed in the last column on the table.
NAMD STMV Benchmark on Dual Xeon 8180 with from 1 to 56 total cores
Number of CPU cores | Simulation Nano-seconds Per Day | Performance Increase | % Efficiency | AVX512 Clock (GHz) |
---|---|---|---|---|
1 | 0.0141 | 1 | 100% | 3.5 |
2 | 0.0275 | 1.95 | 97.6% | 3.5 |
4 | 0.0520 | 3.69 | 92.3% | 3.5 |
8 | 0.102 | 7.20 | 90.1% | 3.3 |
16 | 0.197 | 14.0 | 87.6% | 3.2 |
24 | 0.263 | 18.7 | 77.9% | 3.1 |
32 | 0.341 | 24.2 | 75.6% | 2.8 |
40 | 0.448 | 31.8 | 79.5% | 2.6 |
48 | 0.507 | 36.0 | 75.0% | 2.4 |
56 | 0.595 | 42.2 | 75.4% | 2.3 |
A regression plot using a fit to Amdhal’s Law similar to how it was done for the GPU scaling case gives the following,
I hope you found this interesting. This is definitely the highest performance single node system I’ve ever tested!
Happy computing –dbk