Table of Contents
- The good news
- The bad news
- Test system
- TensorFlow performance with 1-2 RTX Titan GPU’s
- TensorFlow CNN: ResNet-50
- TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset
- Should you get an RTX Titan for machine learning work?
The RTX Titan is NVIDIA’s latest release of the venerable Titan line. Does it live up to expectations? How is the compute performance for machine learning? Is the peer-to-peer performance the same as with the other new RTX 20xx cards? I’ll answer these questions and give some recommendations for using the new RTX Titan.
I was able to get some testing time with 2 of the new RTX Titan GPU’s. There is good news and bad news.
The good news
The RTX Titan has good fp32 and fp16 compute performance. It is similar in characteristics to the RTX 2080Ti but it has twice the memory and better performance.
Having 24GB of memory opens some new possibilities,
- Larger batch sizes for deep learning jobs. That can increase performance and improve convergence in some circumstances.
- Input data with a very large number of features, for example bigger images.
- Having more memory helps to avoid the dreaded OOM (out-of-memory) messages that can come up in a variety of situations.
The lager amount of memory on the RTX Titan is probably its best distinguishing feature for compute. Sometimes not-enough-memory is a “show-stopper”. GPU memory is expensive so I feel the RTX Titan is quite reasonably priced for a 24GB card. The similar (but better) RTX Quadro 6000 with 24GB of memory is more than 2 times more expensive than the RTX Titan.
The bad news
OK, now for the "bad" … one thing that limits some configuration options, and a couple of other minor disappointments.
The worst thing about the RTX Titan is the cooling solution for the card.
At this time the RTX Titan is only available with a dual side fan arrangement. That configuration requires at least one empty slot next to the card to allow for proper cooling. That means you will be limited to two RTX Titans even on a motherboard with 4 X16 slots and good chassis cooling.
Our production engineers have worked up a good chassis and cooling setup that will work well for 2 RTX Titans. This can be configured with an NVLINK bridge using the wide spacing bridge. This configuration may not be listed on our usual configuration pages so please contact [email protected] or 1 888 784-3872 extension 1 for more information.
A minor disappointment with the RTX Titan is the lack of peer-to-peer over PCIe i.e. you have to have an NVLINK bridge to get "P2P". This is not that important in my opinion since in practice it does not seem to have much impact on most multi-GPU workloads. Note, that the RTX Quadro 6000 does have P2P over PCIe when not using the NVLINK bridge (same as the Pascal based cards). The other reason this is not a big concern is that you can't use more than 2 RTX Titans in a workstation system anyway (because of cooling). So, if you do use 2 cards you may as well get the NVLINK bridge too.
The last disappointment is really not an issue with the RTX Titan, it's just part of the design of Turing GPU's. I'm referring to lack of good fp64 performance. The RTX GPU's are a lot like the older GTX NVIDIA cards where the double precision performance is only a small fraction of the single precision (fp32) performance. This is not important for the majority of application since typically, fp32 is used for GPU compute. It's only a disappointment to me because I love the Titan V for its great fp64 performance which is important to me personally for my general scientific computing work. The Titan V is basically a desktop version of the high performance Tesla V100 compute accelerators and I really love the Titan V. I'm looking forward to the next Titan card that is based on a Tesla class GPU.
Test system
Hardware
- Puget Systems Peak Single (I used my personal system that is similar to this, but not quite as nice!)
- Intel Xeon-W 2175 14-core
- 128GB Memory
- 1TB Samsung NVMe M.2
- RTX Titan (2)
Software
- Ubuntu 18.04
- NVIDIA display driver 410.79 (from CUDA install)
- CUDA 10.0 Source builds of,
- p2pBandwidthLatencyTest
- TensorFlow 1.10 and 1.4
- Docker 18.06.1-ce
- NVIDIA-Docker 2.0.3
- NVIDIA NGC container registry
- Container image: nvcr.io/nvidia/tensorflow:18.09-py3 for “Big LSTM”
- Container image: nvcr.io/nvidia/tensorflow:18.03-py2 linked with NCCL and CUDA 9.0 for milti-GPU “CNN”
I used the NVIDIA NGC docker images that I have used in several recent posts. There are newer version of these container images and I will be using them in future posts. I have worked with these newer docker images and they do give better performance with the Turing RTX GPU’s. However, the relative performance is similar to the older versions and I want to include my older testing results for other GPU’s in this post.
For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning
TensorFlow performance with 1-2 RTX Titan GPU’s
I am including relevant results for all of my recent testing with the RTX GPU’s. The two latest posts being, P2P peer-to-peer on NVIDIA RTX 2080Ti vs GTX 1080Ti GPUs and RTX 2080Ti with NVLINK – TensorFlow Performance (Includes Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti and Titan V). Both of these posts may be of interest. After this post I will be updating my testing benchmarks and plan on doing a large multi-GPU comparison post.
The CNN code I am using is from an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. There is more recent versions of this docker image that uses “horovod” on top of MPI for multi-GPU parallelism. I will be using that in future posts. The LSTM “Billion Word” benchmark I’m running is using a newer version (but not the newest) with TensorFlow 1.10 linked with CUDA 10.0.
I’ll give the command-line inputs for reference.
With the addition of the RTX Titan test results the tables and plots are getting bigger and now include GTX 1080 Ti, RTX 2070, 2080, 2080 Ti, Titan V and RTX Titan.
TensorFlow CNN: ResNet-50
Docker container image tensorflow:18.03-py2 from NGC,
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2
Example command line for job start,
NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16
Note, –fp16 means “use tensor-cores”. I have used batch sizes of 64 for most fp32 tests, batch size 128 for fp16 and on the RTX Titan I used 192 for fp32 and 384 for fp16, taking advantage of the 24GB memory the RTX Titan.
[ResNet-50] – GTX 1080Ti, RTX 2070, 2080, 2080Ti, Titan V and RTX Titan – using TensorFlow, Training performance (Images/second)
GPU | FP32 Images/sec |
FP16 (Tensor-cores) Images/sec |
---|---|---|
RTX 2070 | 192 | 280 |
GTX 1080 Ti | 207 | N/A |
RTX 2080 | 207 | 332 |
RTX 2080 Ti | 280 | 437 |
RTX Titan | 294 | 481 |
Titan V | 299 | 547 |
2 x RTX 2080 | 364 | 552 |
2 x 1080 Ti | 367 | N/A |
2 x RTX 2080+NVLINK | 373 | 566 |
2 x RTX 2080 Ti | 470 | 750 |
2 x RTX 2080 Ti+NVLINK | 500 | 776 |
2 x RTX Titan | 572 | 941 |
2 x RTX Titan+NVLINK | 577 | 958 |
TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset
Docker container image tensorflow:18.09-py3 from NGC,
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3
Example job command-line,
/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256
“Big LSTM” – GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V and RTX Titan – TensorFlow – Training performance (words/second)
GPU | words/second |
---|---|
RTX 2070 | 4740 |
RTX 2080 | 5071 |
GTX 1080 Ti | 6460 |
Titan V (Note:1) | 7066 |
Titan V (Note:2) | 8373 |
2 x RTX 2080 | 8882 |
RTX 2080 Ti | 8945 |
RTX Titan | 9095 |
2 x RTX 2080+NVLINK | 9711 |
2 x GTX 1080Ti | 11462 |
2 x RTX 2080 Ti | 15770 |
2 x RTX 2080 Ti+NVLINK | 16977 |
2 x RTX Titan | 17863 |
2 x RTX Titan+NVLINK | 18118 |
- Note: With only 8GB memory on the RTX 2070 and 2080 I had to drop the batch size down to 256 to keep from getting “out of memory” errors. Batch size 448 was used for 1080Ti and RTX 2080Ti. Batch size 640 was used for the RTX Titan.
- Note:1 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA’s code for the LSTM model. The RTX 2080Ti performance was very good!
- Note:2 I re-ran the “big-LSTM” job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of “big-LSTM”.
Should you get an RTX Titan for machine learning work?
It was pretty easy to say "yes" for the RTX 2080Ti as a compute accelerator, especially since they are available with good blower based coolers. Therein lies the problem with the RTX Titan, the cooling is not very good for a GPU at this performance level. It generates a lot of heat under load and dumps that heat right into your chassis. It takes extra care to design chassis cooling adequate to remove that heat and keep the GPU from throttling it's clocks down under heavy load. (We have solved that cooling problem.)
The RTX Titan would be excellent for a single GPU setup. The 24GB of memory will allow for development work on problems that would be difficult or impossible without it. Using two RTX Titans in a system along with a wide-spaced NVLINK bridge is a good high performance configuration as long as the overall system is designed to provide sufficient cooling.
For a multi-GPU (more than 2 GPU) system that needs this level of capability and performance I would recommend the RTX Quardo 6000. This Quadro card has same amount of memory, it has P2P over PCIe enabled, and it has a great cooling design. The only downside of the RTX Quadro is the cost. (I have done testing with 2 RTX Quadro 6000's.)
Overall, all of the RTX GPU's are very good as compute devices. For machine learning workloads they are worthy updates to the (wonderful) "Pascal" based GTX GPU's with better performance and the addition of "tensor-cores". The RTX GPU's are also innovative! Outside of compute I am looking forward to seeing what developers do with the ray-tracing capabilities of these cards.
Happy computing! –dbk