Multi-GPU scaling with Titan V and TensorFlow on a 4 GPU Workstation



I have been qualifying a 4 GPU workstation for Machine Learning and HPC use and it is looking very good! The last confirmation testing I wanted to do was running it with some TensorFlow benchmarks on 4 NVIDIA Titan V GPU’s. I have that systems up and running and the multi-GPU scaling looks good.


System configuration

Note: At Puget Systems we build a lot of machines for Machine Learning development work and we have been anxious to have a PCIe X16 x 4 GPU configuration on single-socket Xeon-W. There have been several delays caused by the Intel Spectre/Meltdown mess — Hardware release delays, broken BIOS updates, broken firmware updates etc.. Things are settling down now but there is still some problems with component availability. We expect to have an “available” release of the system used in this post in a few weeks time.

Hardware

System under test,

  • Gigabyte motherboard with 4 X16 PCIe sockets (1 PLX switch on sockets 2,3)
  • Intel Xeon W-2195 18 core (Skylake-W with AVX512)
  • 256GB Reg ECC memory (up to 512GB)
  • 4 x NVIDIA Titan V GPU’s
  • Samsung 256GB NVMe M.2

… Full configuration options will be available on release.

Software

  • Ubuntu 16.04
  • Docker 18.03.0-ce
  • NVIDIA Docker V2
  • TensorFlow 1.7 (running on NVIDIA NGC docker image)

For details on the system environment setup please see my posts,


Testing Setup

This multi-GPU scaling testing will be using the same convolution neural network models implemented with TensorFlow that I used in my recent post GPU Memory Size and Deep Learning Performance (batch size) 12GB vs 32GB — 1080Ti vs Titan V vs GV100. The code I’m running is from the TensorFlow docker image on NVIDIA NGC. The application is cnn in the nvidia-examples directory. And, I am using synthetic data for image input.

My command line to start the container is, (after doing docker login nvcr.io to access the NGC docker registry)

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Note: there was a newer image available tagged 18.04-py but when I tried using that image all CNN jobs failed.


Results

Results are for training the convolution neural networks GoogLeNet, ResNet-50 and Inception-4. These are increasingly complex models. The training is using synthetic image data and measures the forward and backward propagation through the networks for 80 batches of images of batch-size given in the tables. Reported results are Images-per-Second.

The scaling looks very good, as seen in the tables, bar-charts and Amdhal’s Law curve-fit plots.

The parallel GPU scaling performance does decrease with the increasing complexity of the model but overall looks quite good. This reflects positively on the hardware being testing and on the quality of TensorFlow and the code implementation for these models.

I have included results for both FP32 (32-bit single precision floating point) and FP16 (16-bit half precision floating point). FP16 is what is used by Tensor-cores on the Volta based GPU’s like the Titan V. Tensor-cores can significantly increase performance at the risk of numerical instability in the job runs. I believe they should be trialled for real-world work since the performance gains are compeling. I have a discussion about Tensor-cores in this post, NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning.

When looking at parallel scaling I like to include fitting the data to an Amdhal’s Law. These curves give give a representation of devaition form the ideal linear scaling. The cure-fit also gives a parallel fraction P that is an indication of the maximum speedup achievable. Maximum speedup is unlikely to exceed 1/(1-P) with any number of compute devices in the system (GPU’s in our case).

Here’s the expression of Amdhal’s Law that I did a regression fit of the data to,

Performance_Images_Per_Second = Images_Per_Second_For_One_GPU/((1-P)+(P/Number_of_GPU's))

GoogLeNet Multi-GPU scaling with 1-4 Titan V GPU’s using TensorFlow — Training performance (Images/second)

Number of GPU’s FP32
Images/sec (total batch size)
FP16
Images/sec (total batch size)
1 851.3 (256) 1370.6 (512)
2 1525.1 (512) 2517.0 (1024)
3 2272.3 (768) 3661.3 (1536)
4 3080.2 (1024) 4969.6 (2048)

GoogLeNet Bar

GoogLeNet Amdhal


ResNet-50 Multi-GPU scaling with 1-4 Titan V GPU’s using TensorFlow — Training performance (Images/second)

Number of GPU’s FP32
Images/sec (total batch size)
FP16
Images/sec (total batch size)
1 293.0 (64) 571.4 (128)
2 510.8 (128) 978.6 (256)
3 701.8 (192) 1375.9 (384)
4 923.4 (256) 1808.9 (512)

ResNet50 Bar

ResNet50 Amdhal


Inception-4 Multi-GPU scaling with 1-4 Titan V GPU’s using TensorFlow — Training performance (Images/second)

Number of GPU’s FP32
Images/sec (total batch size)
FP16
Images/sec (total batch size)
1 89.2 (32) 189.9 (64)
2 153.4 (64) 321.9 (128)
3 205.3 (96) 442.7 (192)
4 265.9 (128) 585.7 (256)

Inception4 Bar

Inception4 Amdhal

Happy computing! –dbk


Appendix: Peer to peer bandwidth and latency test results

For completeness, I wanted to include the results from running p2pBandwidthLatencyTest (source available from “CUDA samples” )

The bandwidth and latency for this test system look very good. You do see some expected bandwidth lowering and letency increase across devices 2 and which are on the PLX switch.

./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 555.65   5.75   5.74   5.76
     1   5.86 554.87   5.72   5.74
     2   5.87   5.87 554.87   5.77
     3   5.75   5.76   5.81 555.65
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 554.87   6.03   6.02   6.02
     1   6.05 554.87   6.01   6.03
     2   4.39   4.39 554.08   4.26
     3   4.40   4.40   4.27 553.29
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 566.53  11.05  11.03  10.99
     1  11.07 564.49  11.15  10.92
     2  11.15  11.16 562.86   6.18
     3  11.05  11.01   6.19 564.49
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 564.05  10.83   8.40   8.43
     1  10.84 563.67   8.40   8.43
     2   8.40   8.39 564.90   8.07
     3   8.43   8.42   8.11 563.67
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.94  16.23  16.50  16.51
     1  16.41   3.03  16.53  16.52
     2  17.40  17.54   3.74  18.90
     3  17.42  17.36  18.73   3.08
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.01   5.38   5.55   5.57
     1   5.26   3.00   5.79   5.80
     2   6.81   6.91   3.00   6.64
     3   6.81   6.89   6.62   3.01