Table of Contents
- System configuration
- Test Jobs
- Results
- PCIe X16 vs X8 — GoogLeNet Training, TensorFlow FP32 and FP16 (Tensor-cores)
Titan V GPU’s
[Images/second (total batch size)] - PCIe X16 vs X8 — Billion Words Benchmark LSTM Train, TensorFlow
Titan V GPU’s
[Words per second] - PCIe X16 vs X8 — VGG, Keras (TensorFlow) Memory-streaming 25000 Images
Titan V GPU’s
[Training time for 4 epochs (seconds)] - PCIe X16 vs X8 — VGG in Keras (TensorFlow) Disk-streaming 25000 Images
Titan V GPU’s
[Training time for 4 epochs]
- PCIe X16 vs X8 — GoogLeNet Training, TensorFlow FP32 and FP16 (Tensor-cores)
- Appendix A: Peer to peer bandwidth and latency test results
- Appendix B: VGG model train on cat vs dog image set (25000 images)
PCIe X16 vs PCIe X8 when doing GPU accelerated computing … How much difference does it make? I get asked that question a lot. The answer, of course is, “it depends”. A little better answer is that it probably wont have much effect for real world applications.
In this post I’ll be looking at some common Machine Learning (Deep Learning) application job runs on a system with 4 Titan V GPU’s. I will repeat the job runs at PCIe X16 and X8.
I looked at this issue a couple of years ago and wrote it up in this post, PCIe X16 vs X8 for GPUs when running cuDNN and Caffe. I’ve been doing a lot more Machine Learning work since then and the question of PCIe bandwidth limitations keeps coming up. The machines I run jobs on always have their GPU’s running at X16 either from lanes provided by the CPU and chip-set or with the help of PLX (PEX) PCIe switches. On current Intel single socket systems a PCIe switch is required to get 4 full X16 slots for GPU’s. The alternative possibility for 4 GPU’s on boards that don’t have the PCIe switches is to run the cards at X8 (assuming the motherboard has enough slots to support that). The question is how much performance do you lose from doing that? Lets find out.
System configuration
The system I’m using for this testing is the same system I’ve used recently for tests with Titan V. The recent post on multi-GPU scaling has more details about the configuration and setup, Multi-GPU scaling with Titan V and TensorFlow on a 4 GPU Workstation.
Hardware
System under test,
- Gigabyte motherboard with 4 X16 PCIe sockets (1 PLX switch on sockets 2,3)
- Intel Xeon W-2195 18 core (Skylake-W with AVX512)
- 256GB Reg ECC memory (up to 512GB)
- 4 x NVIDIA Titan V GPU’s
- Samsung 256GB NVMe M.2
Software
- Ubuntu 16.04
- Docker 18.03.0-ce
- NVIDIA Docker V2
- TensorFlow 1.7 (running on NVIDIA NGC docker image)
- Keras 2.1.5 (with TensorFlow back-end from local install with Anaconda Python).
“Sticky Note” method for switching from X16 to X8
I order to keep everything about the system under test the same the PCIe bandwidth was physically changed on the GPU’s by blocking of half the pins on the cards with cut down “sticky notes”. Don’t try this at home! This is equivalent to putting the cards into X8 slots. The following photo shows 4 Titan V GPU’s with half of the pins blocked off.
Test Jobs
Performance was tested with a mixture of benchmark jobs, a “real-world” model training job and, for completeness, a per-to-per bandwidth and latency test.
GoogLeNet Training
This job was run using the TensorFlow docker image on NVIDIA NGC. The application is cnn
in the nvidia-examples directory. And, I am using synthetic data for image input. For details on this doing this job run see, NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning. [Note: this job shows clear benefit from using Tensor-cores so that data will be included in the results. ]
Billion Words Benchmark LSTM Train
This job is also using the docker image mentioned above. The application is big_lstm using the billion word news-feed corpus. See, TensorFlow Scaling on 8 1080Ti GPUs – Billion Words Benchmark with LSTM on a Docker Workstation Configuration for example usage.
VGG model in Keras
For a “real-world” test I did an implementation of the VGG CNN using Keras (with the TensorFlow back-end). This was run in a Jupyter notebook with Anaconda Python installed on the machine. The data used is 25000 images of “dogs vs cats” downloaded from Kaggle. The source is included in appendix B.
Per-to-Per bandwidth and latency
The p2pBandwidthLatencyTest from the NVIDIA CUDA samples was used to show the direct bandwidth halving effect of moving from X16 to X8. The output is in appendix A.
Results
The following tables and charts give a pretty clear indication that X16 has only a small advantage over X8. In some cases there is no apparent difference. This does not mean that this will always be the case! These jobs achieve most of their parallelism by distributing batches across the GPU’s so there is very little card-to-card communication. This means that most of the effect of the lower bandwidth at X8 is going to occur during the transfer of data from CPU space to GPU space. How much that effect your particular job will vary. However, these results suggest that the effect may be modest in most cases.
PCIe X16 vs X8 — GoogLeNet Training, TensorFlow FP32 and FP16 (Tensor-cores)
Titan V GPU’s
[Images/second (total batch size)]
Number of GPU’s | PCIe X16 FP32 |
PCIe X16 FP16 |
PCIe X8 FP32 |
PCIe X8 FP16 |
---|---|---|---|---|
1 | 851.3 (256) | 1370.6 (512) | 838 | 1319 |
2 | 1525.1 (512) | 2517.0 (1024) | 1519 | 2424 |
3 | 2272.3 (768) | 3661.3 (1536) | 2153 | 3572 |
4 | 3080.2 (1024) | 4969.6 (2048) | 2943 | 4707 |
The code for this job run is highly optimized for GPU and there is only a minor difference between X16 and X8. The difference does increase with more GPU’s. This chart also shows the nice improvement from using Tensor-cores (FP16) on the Titan V!
PCIe X16 vs X8 — Billion Words Benchmark LSTM Train, TensorFlow
Titan V GPU’s
[Words per second]
Number of GPU’s | PCIe X16 | PCIe X8 |
---|---|---|
1 | 8373 | 8176 |
2 | 15483 | 14686 |
3 | 20462 | 19178 |
4 | 22058 | 20565 |
Again, the difference between X16 and X8 is only minor and increasing with more GPU’s.
PCIe X16 vs X8 — VGG, Keras (TensorFlow) Memory-streaming 25000 Images
Titan V GPU’s
[Training time for 4 epochs (seconds)]
Number of GPU’s | PCIe X16 | PCIe X8 |
---|---|---|
1 | 476 | 476 |
2 | 262 | 269 |
3 | 222 | 202 |
4 | 188 | 195 |
In this Keras implementation of VGG there is even less performance difference between X16 and X8. The multi-GPU scaling beyond 2 GPU’s is also not as good as the previous jobs. The image data was loaded into memory and fed to the model through Python variables. The Python source that was used for this job is given in Appendix B.
PCIe X16 vs X8 — VGG in Keras (TensorFlow) Disk-streaming 25000 Images
Titan V GPU’s
[Training time for 4 epochs]
Number of GPU’s | PCIe X16 | PCIe X8 |
---|---|---|
1 | 445 | 445 |
2 | 320 | 314 |
3 | 316 | 318 |
4 | 316 | 316 |
For this job run the image data was fed to the model in batches loaded from disk and put into the proper format on-the-fly. This caused multi-GPU scaling to fall off completely after 2 GPU’s. And, again, dropping to X8 had almost no effect on the performance. The real bottleneck here is the time taken to convert image data pulled from disk. [Note: the data was stored on a fast NVMe SSD.]
There you have it, PCIe X16 vs PCIe X8. Not a lot of difference from what I’ve seen from the testing above. Still, I think I would always prefer to have X16 for my GPU’s!
What happens if you drop even further? I don’t really know but I suspect the X4 may still be OK. If you are thinking about turning your X1 “coin mining rig” into a machine learning box I think it would probably work OK. If you try that let me know!
Happy computing! –dbk
Appendix A: Peer to peer bandwidth and latency test results
For completeness, I wanted to include the results from running p2pBandwidthLatencyTest
(source available from “CUDA samples” )
The bandwidth and latency for this test system look very good. You do some expected bandwidth lowering and latency increase across devices 2 and which are on the PLX switch.
PCIe X16 | PCIe X8 |
---|---|
|
|
Appendix B: VGG model train on cat vs dog image set (25000 images)
import numpy as np np.random.seed(42)
import keras from keras.models import Sequential from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D from keras.layers.normalization import BatchNormalization from keras.callbacks import TensorBoard from keras.preprocessing.image import ImageDataGenerator from keras.utils import multi_gpu_model
model = Sequential() model.add(Conv2D(64, 3, activation='relu', input_shape=(224,224,3))) model.add(Conv2D(64, 3, activation='relu')) model.add(MaxPooling2D(2,2)) model.add(BatchNormalization()) model.add(Conv2D(128, 3, activation='relu')) model.add(Conv2D(128, 3, activation='relu')) model.add(MaxPooling2D(2,2)) model.add(BatchNormalization()) model.add(Conv2D(256, 3, activation='relu')) model.add(Conv2D(256, 3, activation='relu')) model.add(Conv2D(256, 3, activation='relu')) model.add(MaxPooling2D(2,2)) model.add(BatchNormalization()) model.add(Conv2D(512, 3, activation='relu')) model.add(Conv2D(512, 3, activation='relu')) model.add(Conv2D(512, 3, activation='relu')) model.add(MaxPooling2D(2,2)) model.add(BatchNormalization()) model.add(Conv2D(512, 3, activation='relu')) model.add(Conv2D(512, 3, activation='relu')) model.add(Conv2D(512, 3, activation='relu')) model.add(MaxPooling2D(2,2)) model.add(BatchNormalization()) model.add(Flatten()) model.add(Dense(4096, activation='tanh')) model.add(Dropout(0.5)) model.add(Dense(4096, activation='tanh')) model.add(Dropout(0.5)) model.add(Dense(1, activation='sigmoid'))
model.summary()
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d_1 (Conv2D) (None, 222, 222, 64) 1792 _________________________________________________________________ conv2d_2 (Conv2D) (None, 220, 220, 64) 36928 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 110, 110, 64) 0 _________________________________________________________________ batch_normalization_1 (Batch (None, 110, 110, 64) 256 _________________________________________________________________ conv2d_3 (Conv2D) (None, 108, 108, 128) 73856 _________________________________________________________________ conv2d_4 (Conv2D) (None, 106, 106, 128) 147584 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 53, 53, 128) 0 _________________________________________________________________ batch_normalization_2 (Batch (None, 53, 53, 128) 512 _________________________________________________________________ conv2d_5 (Conv2D) (None, 51, 51, 256) 295168 _________________________________________________________________ conv2d_6 (Conv2D) (None, 49, 49, 256) 590080 _________________________________________________________________ conv2d_7 (Conv2D) (None, 47, 47, 256) 590080 _________________________________________________________________ max_pooling2d_3 (MaxPooling2 (None, 23, 23, 256) 0 _________________________________________________________________ batch_normalization_3 (Batch (None, 23, 23, 256) 1024 _________________________________________________________________ conv2d_8 (Conv2D) (None, 21, 21, 512) 1180160 _________________________________________________________________ conv2d_9 (Conv2D) (None, 19, 19, 512) 2359808 _________________________________________________________________ conv2d_10 (Conv2D) (None, 17, 17, 512) 2359808 _________________________________________________________________ max_pooling2d_4 (MaxPooling2 (None, 8, 8, 512) 0 _________________________________________________________________ batch_normalization_4 (Batch (None, 8, 8, 512) 2048 _________________________________________________________________ conv2d_11 (Conv2D) (None, 6, 6, 512) 2359808 _________________________________________________________________ conv2d_12 (Conv2D) (None, 4, 4, 512) 2359808 _________________________________________________________________ conv2d_13 (Conv2D) (None, 2, 2, 512) 2359808 _________________________________________________________________ max_pooling2d_5 (MaxPooling2 (None, 1, 1, 512) 0 _________________________________________________________________ batch_normalization_5 (Batch (None, 1, 1, 512) 2048 _________________________________________________________________ flatten_1 (Flatten) (None, 512) 0 _________________________________________________________________ dense_1 (Dense) (None, 4096) 2101248 _________________________________________________________________ dropout_1 (Dropout) (None, 4096) 0 _________________________________________________________________ dense_2 (Dense) (None, 4096) 16781312 _________________________________________________________________ dropout_2 (Dropout) (None, 4096) 0 _________________________________________________________________ dense_3 (Dense) (None, 1) 4097 ================================================================= Total params: 33,607,233 Trainable params: 33,604,289 Non-trainable params: 2,944 _________________________________________________________________
#parallel_model = multi_gpu_model(model, gpus=2)
#parallel_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Setup data
tensorbrd = TensorBoard('./logs/vgg-1') batch_size = 64
train_dir= './train' img_size = (224,224) train_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow_from_directory( train_dir, target_size=img_size, #batch_size=batch_size, # the following batch size loads all 25000 images at once batch_size=25000, class_mode='binary')
Found 25000 images belonging to 2 classes.
for X, Y in train_generator: print('data X shape:', X.shape) print('labels Y shape:', Y.shape) break
data X shape: (25000, 224, 224, 3) labels Y shape: (25000,)
#parallel_model.fit_generator(train_generator, epochs=4, verbose=1, callbacks=[tensorbrd], # steps_per_epoch=25000 // batch_size, # max_queue_size=25000, workers=16) #model.fit_generator(train_generator, epochs=4, verbose=1, callbacks=[tensorbrd]) #parallel_model.fit(X,Y, batch_size=batch_size, epochs=4, verbose=1, callbacks=[tensorbrd]) model.fit(X,Y, batch_size=batch_size, epochs=4, verbose=1, callbacks=[tensorbrd])