Table of Contents
We’ve got a quick test of CUDA performance on three generations of NVIDIA’s Titan X GPU’s for you. NVIDA released the Pascal GeForce Titan X much earlier than people were expecting (including NVIDIA :-). Just like the older Titan cards the new Pascal based card did not disappoint! For comparison we have a GTX 1080 in the mix too.
This is just a brief look at performance using the nbody code from the CUDA samples and a Molecular Dynamics simulation (stmv)using NAMD.
The short story is — the Titan X (Pascal) is an amazing video card and CUDA compute performance is stunning!
The setup for this testing is Ubuntu 16.04 with CUDA 8.0 RC ( as of the date of this writing you need to be an NVIDIA registered developer to access the release candidate. ) The NVIDIA display driver is 367.35 from the “Graphics-Drivers ppa”
Base systems
- The Peak Tower Single
- CPU: Intel Core-i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
- Memory: 64 GB DDR4 2133MHz Reg ECC
- PCIe: (4) X16-X16 v3
- The Peak Tower Quad
- CPU: (4) Intel Xeon E7 8867v3 16-core @ 2.5GHz (2.7GHz All-Core-Turbo)
- Memory: 512 GB DDR4 2133GHz Reg ECC
- PCIe: (4) X16-X16 v3
Note: The Quad was used for NAMD for two reasons: 1) I had it on the bench and 2) The GPU’s are so fast that I wanted lots of CPU performance so we could see the difference between the GPU’s! (I’m also using some older v3 CPU’s we use E5/E7 v4 in our current Quad builds)
Video cards used for testing. ( data from nvidia-smi )
| Card | CUDA cores | GPU clock MHz | Memory clock MHz* | Application clock MHz** | FB Memory MiB |
|---|---|---|---|---|---|
| Titan X (Pascal) | 3584 | 1911 | 5005 | 1417 | 12186 |
| TITAN X (Maxwell) | 3072 | 1392 | 3505 | 1000 | 12206 |
| Titan Black | 2880 | 1202 | 3500 | 888 | 6082 |
| GTX 1080 | 2560 | 1911 | 5005 | 1607 | 8113 |
Results
| Card | nbody single precision GFLOP/s | NAMD run time (sec) | NAMD day/ns |
|---|---|---|---|
| Titan X (Pascal) | 7507 | 41 | 0.570 |
| TITAN X (Maxwell) | 4292 | 55 | 0.889 |
| Titan Black | 2302 | 81 | 1.460 |
| GTX 1080 | 5429 | 48 | 0.709 |
Notes:
nbody -benchmark -numbodies=256000 -device= {one of (0,1,2,3)}
namd2 +p 64 +setcpuaffinity stmv.namd
This is more CPU cores than needed for balance with the GPU but I wanted the GPU to be performance limiting.
I’m looking forward to setting up NVIDIA DIGITS 4 (just released to developers) and seeing what kind performance we see with training sets on Caffe using the new Titan X.
Extra data … if you like that sort of thing
kinghorn@u16ps:~/testing/samples-8.0/bin/x86_64/linux/release$ ./bandwidthTest 0 [CUDA Bandwidth Test] - Starting... Running on... Device 0: TITAN X Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 11853.2 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12854.1 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 342979.6 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
kinghorn@u16ps:~/testing/samples-8.0/bin/x86_64/linux/release$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "TITAN X"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12186 MBytes (12778274816 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1531 MHz (1.53 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0
Compute Mode:
- Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -
Device 1: "GeForce GTX TITAN Black"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6082 MBytes (6377439232 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 980 MHz (0.98 GHz)
Memory Clock rate: 3500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 9 / 0
Compute Mode:
- Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -
Device 2: "GeForce GTX TITAN X"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12207 MBytes (12799574016 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1076 MHz (1.08 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0
Compute Mode:
- Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -
Device 3: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8113 MBytes (8507555840 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1734 MHz (1.73 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 10 / 0
Compute Mode:
- Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -
> Peer access from TITAN X (GPU0) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from TITAN X (GPU0) -> GeForce GTX TITAN X (GPU2) : No
> Peer access from TITAN X (GPU0) -> GeForce GTX 1080 (GPU3) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> TITAN X (GPU0) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX TITAN X (GPU2) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX 1080 (GPU3) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> TITAN X (GPU0) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX 1080 (GPU3) : No
> Peer access from GeForce GTX 1080 (GPU3) -> TITAN X (GPU0) : No
> Peer access from GeForce GTX 1080 (GPU3) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from GeForce GTX 1080 (GPU3) -> GeForce GTX TITAN X (GPU2) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X, Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN X, Device3 = GeForce GTX 1080
Result = PASS
Happy computing –dbk