TensorFlow Performance with 1-4 GPUs — RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V



Introduction

This post is an update and expansion of much of the GPU testing I have been doing over the last several months. I am using current (as of this post date) TensorFlow builds on NVIDIA NGC, the most recent display driver and I have results for up to 4 GPU’s, including NVLINK, with several of the cards under test. This is something that I have been promising to do!


Test system

Hardware

  • Puget Systems Peak Single (I used a test-bed system with components that we typically use in the Peak Single configured for Machine Learning)
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 2TB Intel 660p NVMe M.2
  • RTX Titan (1-2), 2080Ti (1-4), 2080 (1-4), 2070 (1-4)
  • GTX 1660Ti (1-2), 1080Ti (1-4)
  • Titan V (1-2)

Software

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning


TensorFlow Multi-GPU performance with 1-4 NVIDIA RTX and GTX GPU’s

This is all fresh testing using the updates and configuration described above. Hopefully it will give you a comparative snapshot of multi-GPU performance with TensorFlow in a workstation configuration.

CNN (fp32, fp16) and Big LSTM job run batch sizes for the GPU’s

Batch size does affect performance and larger sizes are usually better. The batch size is limited by the amount of memory available on the GPU’s. “Reasonable” values that would run without giving “out of memory” errors were used. Multi-GPU jobs used the same batch sizes settings as single GPU job since they are set per processes. That means the “effective” batch sizes are multiples of the batch size since the jobs are “data parallel”. The batch size information for the different cards and job types is in the table below.

CNN [ResNet-50] fp32, fp16 and RNN [Big LSTM] job Batch Sizes for the GPU’s tested

GPU ResNet-50 FP32
batch size
RedNet-50 FP16 (Tensor-cores)
batch size
Big LSTM
batch size
RTX Titan 192 384 640
RTX 2080 Ti 64 128 448
RTX 2080 64 128 256
RTX 2070 64 128 256
GTX 1660 Ti 32 64 128
Titan V 64 128 448
GTX 1080 Ti 64 N/A 448
GTX 1070 64 N/A 256

TensorFlow CNN: ResNet-50

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example command lines for starting jobs,

# For a single GPU
python resnet.py  --layers=50  --batch_size=64  --precision=fp32

# For multi-GPU's 
mpiexec --allow-run-as-root -np 2 python resnet.py  --layers=50  --batch_size=64  --precision=fp32

Notes:

  • Setting --precision=fp16 means “use tensor-cores”.
  • --batch_size= batch sizes are varied to take advantage of available memory on the GPU’s.
  • Multi-GPU in this version of the CNN docker image is using “Horovod” for parallel execution. That means it is using MPI and in particular OpenMPI is being used in the container image. The numbers in the charts for 1, 2 and 4 GPU’s show very good parallel scaling with horovod, in my opinion!

[ResNet-50 fp32] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU’s

ResNet-50

[ResNet-50 fp16] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU’s

ResNet-50 with fp16

The charts above mostly speak for themselves. One thing to notice for these jobs is that the peer-to-peer communication advantage of using NVLINK has only a small impact. That will not be the case for the LSTM job runs.


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example job command-line,

python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 
--datadir=./data/ 1-billion-word-language-modeling-benchmark-r13output/ \
--hpconfig run_profiler=False,max_time=240,num_steps=20,num_shards=8,num_layers=2,\
learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,\
state_size=8192,num_sampled=8192,batch_size=448

Note:

  • --num_gpus= and batch_size= are the only parameters changed for the different job runs.

[Big LSTM] TensorFlow, Training performance (words/second) with 1-4 NVIDIA RTX and GTX GPU’s

LSTM with

Notes:

  • Batch size and GPU-to-GPU (peer-to-peer) communication have a significant impact on the performance with this recurrent neural network. The higher end GPU’s have the advantage of both increased numbers of compute cores and the availability of larger memory spaces for data and instructions as well as the possibility of using the high performance NVLINK for communication.
  • NVLINK significantly improved performance. This performance improvement was even apparent when using 2 NVLINK pairs with 4 GPU’s. I was a little surprised by this since I expected it to bottleneck on the “memcpy” to CPU memory space needed between the remaining non NVLINK connected pairs.

Happy computing –dbk