P2P peer-to-peer on NVIDIA RTX 2080Ti vs GTX 1080Ti GPUs



In my recent testing with NVIDIA’s new RTX 20xx GPU’s I had been focused on application performance. That included testing using the NVLINK bridge for direct GPU-to-GPU communication i.e. peer-to-peer (P2P). I deliberately used an older version of NVIDIA’s docker image for the “CNN” TensorFlow test application so that I would be using code linked with the NCCL library to give multi-GPU capability and to take advantage of P2P communication. The results on dual RTX 2080Ti showed only a modest, ~7%, performance increase when using NVLINK.

NVLINK provides an impressive bi-directional bandwidth of nearly 95GB/sec! “Normal “mem-copy provides 11.5 GB/s bidirectional bandwidth on RTX 2080Ti. That is 9 times less than with NVLINK but …

…NVLINK didn’t make as large of an impact as I had expected. I assumed that was partly due to the fact that “normal” P2P over the PCIe bus was pretty effective and the 4-5 fold bandwidth increase with NVLINK just didn’t matter that much. What I didn’t notice was that the communication bandwidth difference was actually near a factor of ten! I had assumed that P2P worked as normal over PCIe with the new RTX cards but in fact it is disabled on the new RTX GPU’s unless you have the NVLINK bridge attached.

It was pointed out in the comments on one of my testing posts that P2P was disabled on RTX 20xx cards and that surprised me since I had looked at testing that showed that but didn’t notice the anomaly. I was just focused on application performance during the short time I had with the cards!

I have used the term “Disabled”, however, I don’t know that P2P over PCIe is actually disabled. This may just be a design trade-off for the Turing architecture(?). This is a complicated GPU with new features for Ray-Tracing etc.. I’m sure there were engineering decisions that had to be made to accommodate the many uses for these cards. P2P is only relevant for multi-GPU application and most applications do not specifically use it, even compute focused applications. For a detailed discussion of the Turing architecture see this (86 page!) white paper
“NVIDIA Turing GPU Archetecture”. (I have not read through this document, yet.)

I have only tested with 2 GPU’s. There may be more impact from lack of P2P when 4 or more GPU’s are being used. For a code that is performance dependent on GPU-GPU communication more cards means more possibility of memory access contention when using mem-copy. I will be testing with 4 cards in a few weeks and will report my findings. I suspect that “real” application performance will not be heavily impacted. The other consideration is that NVLINK on the RTX cards only supports two devices. So, if you have code that is using P2P and is performance dependent on it then you will get great performance with two cards but could have difficulty with more GPU’s.


What is NVIDIA CUDA Peer-to-Peer (P2P)?

OK, so what is P2P? In a very simplified description, P2P is functionality in NVIDIA GPU’s that allow CUDA programs to access and transfer data from one GPU’s memory to another without having to go through a shared pool of system memory attached to a CPU. It has been a feature for NVIDIA GPU’s for 8 or 9 years. P2P together with “Unified Virtual Addressing” (UVA) were big improvements in CUDA for GPU computing. They allowed efficient use of multi-GPU and multi-Node system environments and simplified programming for highly parallel code.

On a Workstation with 2-4 GPU’s P2P and UVA can give a modest performance improvement for some programs. For large GPU accelerated Supercomputers it allows a GPU on one node (system) to access memory on a GPU on another node by using RDMA (Remote Direct Memory Access) over a high-speed network like Infiband. That is very import for massively parallel supercomputing and is driving the performance of the fastest computer systems in the world. On Workstations it is significant but not nearly as important since standard mem-copy back and forth through a CPU’s memory pool is reasonably efficient. Note: this is exactly why a single CPU systems is often a better choice for GPU accelerated workstations since a dual CPU system can have slow-downs caused by memory transfers from one CPU memory space to a GPU that is attached to PCIe lanes on the other CPU!

Following are diagrams sourced from older NVIDIA developer blogs that give a good visual representation of P2P and UVA.


This diagram shows how memory is transferred using a shared pool of memory (“SysMem”) that is attached to the CPU. “Chipset” would be the PCIe bus.

P2P direct


Here is an illustration of two types of P2P communication, Access and Transfer.

P2P access


For completeness, this diagram shows how Unified Virtual Addressing (UVA) appears to the CPU and GPU’s. You should understand that UVA is largely a convenience for programmers. It makes memory management easier by providing a single memory address space and reduces the amount of code that has to be written. On a single workstation it probably doesn’t improve performance. UVA is not the same thing as “memory-pooling”. I believe memory-pooling is something that some NVIDIA Quadro cards can do to provide a large graphics frame-buffer across multiple cards. (I’m sure someone will point it out in the comments if I’m wrong).

UVA


Look at the diagrams above and replace PCIe with NVLINK. NVLINK provides the same functionality as PCIe but it is lower latency and higher bandwidth. NVLINK bandwidth is about 4-5 times that of PCIe v3 X16. However, keep in mind that NVLINK only comes into play during memory access or transfer between GPU’s (except on IBM Power which connects to CPU’s too). Good parallel code will try to minimize “communication”. NVLINK is definitely a good thing but it’s high performance may not have as large an impact as you might think unless communication is the major bottleneck in your code. [NVLINK and UVA become a significant performance boost when the communication is over network fabric between nodes in a cluster.]

Note: We should see PCIe version 4 chip-sets from Intel and AMD, along with motherboards that support it, by the end of 2019. That will be a significant improvement over the current PCIe v3.


Test setup and Results

The most important thing to keep in mind with the results that follow is that I am only looking at P2P and NVLINK with two GPU’s. With 4 or more GPU’s the effect of not having P2P available will possibly have more impact. NVLINK on all of the RTX cards only supports connection between two devices.

Testing hardware and software

Hardware (My personal Workstation)

  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • GPU’s
  • 2 x GTX 1080Ti
  • 2 x RTX 2080Ti

Software

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the “Big LSTM billion word” model training I use the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from “nvidia-examples” in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning


Results Summary

The following two tables will present a condensed representation of the main results of the P2P impact testing. There will be an appendex with the (trimmed) raw output data.

Direct “Synthetic” Measurment of P2P Performance for 2 x RTX 2080Ti and 2 x GTX 1080Ti

Test 2 x RTX 2080Ti 2 x 2080Ti + NVLINK 2 x GTX 1080Ti
P2P Fabric NONE NVLINK PCIe
Bidirectional Bandwidth
P2P Enabled
11.5 GB/s 93.6GB/s 20.3GB/s
Bidirectional Bandwidth
P2P Disabled (“memcopy”)
11.5 GB/s 11.3GB/s 20.2GB/s
GPU-GPU Latency
P2P Enabled
12.5us 1.7us 1.2us
GPU-GPU Latency
P2P Disabled
11us 11.9us 11us

Notes:

  • The numbers in these results generally have a 10-15% error margin.
  • The bandwidth for the 2080Ti’s is closer to what would be expected when having GPU’s connected to PCIe X8 slots. All of the tests were with the cards connected to PCIe X16 slots!
  • The bandwidth for the 1080Ti’s was invariant to enabling P2P but the latency showed significant improvement.

The next results are using the machine learning applications that I have been using recently in my other GPU performance posts.

  • CNN – A Convolution Neural Network job measuring training performance for the ResNet-50 model.
  • BigLSTM – A Long Short Term Memory network training job on a billion word corpus.

Application Measurement of P2P Performance for 2 x RTX 2080Ti and 2 x GTX 1080Ti — CNN and LSTM Deep-Learning

Test 2 x RTX 2080Ti 2 x 2080Ti + NVLINK 2 x GTX 1080Ti
CNN (ResNet-50, train) 476.6 images/s 490.1 images/s 366.8 images/s
CNN (ResNet-50, train) fp16 735.4 images/s 760.9 images/s N/A
Big-LSTM (billion word, train) 15496 words/s 16753 words/s 11462 words/s

Notes:

  • fp16 means “using Tensor-cores”. The GTX 10xx GPU’s do not have fp16 Tensor-cores available.

Conclusions

The bottom line is that NVLINK and P2P have amazing performance when measured directly but the impact on application performance is likely to be minimal for most multi-GPU programs running on a workstation. However, that is not always going to be the case. I don’t know of any programs that are severely limited by P2P, but YOU may have code that you know is! In that case a 2 GPU system with NVLINK may give you great performance. Although, if you want to use 4 or more GPU’s then the RTX GPU’s may not be so great if you are dependent on P2P. This is especially true since it looks like “normal” mem-copy is slower with the RTX GPU’s. I will be testing a variety of GPU’s in configurations up to 4 GPU’s. I will be particularly cognizant of effects of P2P, or rather, the lack of it.

I also did all of the testing in this post on a pair of RTX Titan GPU’s and the results were similar to what I observed with the 2080Ti. (I’ll be writing an RTX Titan post soon.) There is no magic greatness with the Titan over the 20xx cards. …but the performance is pretty good! … and 24GB of memory is nice too!

My colleague William George is doing some testing with RTX Quadro cards right now and it looks like there is no magic there either as far as P2P goes. You may see some comments in this post from him.

I think what we are seeing is just part of the design trade-offs of the Turing architecture. This is a great GPU with some very interesting and innovative features. I am impressed with the compute performance and feel they are a nice improvement over the 10xx cards. I’m looking forward to seeing what developers do with the Ray-Tracing features too!

When I get my comprehensive multi-GPU testing done I’ll have more to say about Turing GPU’s for compute. From what I’ve seen so far I’m pleased with performance.

Comments are welcome!

Happy computing –dbk

Appendix: Raw output (somewhat trimmed for space)

If you want details on what I did and what I saw during this testing the following output has my command lines and program output. Enjoy!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 x 1080Ti ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/Documents/Puget/blog-posts$ nvidia-smi
Fri Jan  4 14:32:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:65:00.0  On |                  N/A |
| 28%   34C    P8    12W / 250W |    146MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 28%   25C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce GTX 1080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce GTX 1080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce GTX 1080 Ti (GPU0) supports UVA: Yes
> GeForce GTX 1080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 9.67GB/s

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 351.91  11.48
     1  11.52 354.79
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 352.55  10.40
     1  10.38 355.11
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 355.00  20.18
     1  20.14 356.41
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 354.62  20.30
     1  20.28 355.28
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.46  10.85
     1  12.11   1.63

   CPU     0      1
     0   3.04   7.42
     1   7.32   2.99
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.46   1.17
     1   1.22   1.63

   CPU     0      1
     0   3.23   2.02
     1   2.02   3.05

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@ae162b83b0ff:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp32
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-04 23:13:01.204755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.61GiB
2019-01-04 23:13:01.390387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b3:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-01-04 23:13:01.391392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-04 23:13:01.394017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-04 23:13:01.394025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2019-01-04 23:13:01.394029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2019-01-04 23:13:01.394037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2019-01-04 23:13:01.394059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    ...
    47     1   365.6  12.486 0.10000
    48     1   365.7  12.407 0.10000
    49     1   368.1  12.393 0.10000
    50     1   367.4  12.333 0.10000
----------------------------------------------------------------
Images/sec: 366.8 +/- 0.3 (jitter = 1.0)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@96f4797b1511:/workspace/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=/projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448
...
2019-01-05 02:00:25.535364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2019-01-05 02:00:25.535371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y
2019-01-05 02:00:25.535377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N
2019-01-05 02:00:25.538173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10254 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2019-01-05 02:00:25.640711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)
Processing file: /projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00033-of-00100
Finished processing!
Iteration 1, time = 20.76s, wps = 863, train loss = 13.0247
Iteration 2, time = 17.34s, wps = 1033, train loss = 12.9863
Iteration 3, time = 1.56s, wps = 11452, train loss = 12.9114
Iteration 4, time = 1.57s, wps = 11430, train loss = 12.8277
Iteration 5, time = 1.57s, wps = 11418, train loss = 12.6549
Iteration 6, time = 1.57s, wps = 11436, train loss = 11.7663
Iteration 7, time = 1.56s, wps = 11462, train loss = 26.3362

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 x 2080Ti ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~$ nvidia-smi
Fri Jan  4 15:35:28 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0  On |                  N/A |
| 41%   38C    P0    63W / 260W |    169MiB / 10986MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 41%   32C    P8    11W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : No
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     0
     1	     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 530.74   5.79
     1   5.82 532.37
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 530.74   5.79
     1   5.81 532.32
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.21  11.52
     1  11.57 535.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 526.45  11.57
     1  11.50 527.09
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.89  11.44
     1  14.94   1.31

   CPU     0      1
     0   3.04   7.11
     1   7.70   2.83
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.89  16.14
     1  12.53   1.33

   CPU     0      1
     0   3.05   7.19
     1   7.26   2.82

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@f11011faf31d:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp32
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-04 23:46:03.045773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-04 23:46:03.299061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-04 23:46:03.299165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-04 23:46:03.301627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-04 23:46:03.301636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N
2019-01-04 23:46:03.301640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y
2019-01-04 23:46:03.301648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-04 23:46:03.301654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    47     1   477.9  11.907 0.10000
    48     1   472.7  11.862 0.10000
    49     1   474.7  11.885 0.10000
    50     1   474.9  11.872 0.10000
----------------------------------------------------------------
Images/sec: 476.6 +/- 0.5 (jitter = 2.7)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@f11011faf31d:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=128 --num_gpus=2 --fp16
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=128
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  256 global
             128 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-04 23:56:44.953679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-04 23:56:45.217624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-04 23:56:45.217745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-04 23:56:45.217827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-04 23:56:45.217834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N
2019-01-04 23:56:45.217840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y
2019-01-04 23:56:45.217849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-04 23:56:45.217856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    46     1   733.9  10.043 0.10000
    47     1   733.1  10.038 0.10000
    48     1   733.2  10.020 0.10000
    49     1   734.2  10.033 0.10000
    50     1   734.4  10.034 0.10000
----------------------------------------------------------------
Images/sec: 735.4 +/- 0.6 (jitter = 3.3)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@98756a012bec:/workspace/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=/projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448
...
2019-01-05 01:44:32.742663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2019-01-05 01:44:32.742672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N N
2019-01-05 01:44:32.742677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N N
2019-01-05 01:44:32.745709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10008 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 01:44:32.839719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10171 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Processing file: /projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00097-of-00100
Finished processing!
Iteration 1, time = 20.43s, wps = 877, train loss = 13.0220
Iteration 2, time = 17.73s, wps = 1011, train loss = 12.9759
Iteration 3, time = 1.17s, wps = 15269, train loss = 12.9492
Iteration 4, time = 1.16s, wps = 15486, train loss = 12.8480
Iteration 5, time = 1.18s, wps = 15229, train loss = 12.6204
Iteration 6, time = 1.16s, wps = 15496, train loss = 11.6590
Iteration 7, time = 1.15s, wps = 15548, train loss = 29.8347
Iteration 8, time = 1.16s, wps = 15439, train loss = 55.7615

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 x 2080Ti + NVLINK +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~$ nvidia-smi  nvlink -c
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-10f085c9-3950-42ce-cca9-e3a7f64520d9)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false
	 Link 1, P2P is supported: true
	 Link 1, Access to system memory supported: true
	 Link 1, P2P atomics supported: true
	 Link 1, System memory atomics supported: true
	 Link 1, SLI is supported: true
	 Link 1, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-463f5ae1-b594-9afc-359f-def62ca73137)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false
	 Link 1, P2P is supported: true
	 Link 1, Access to system memory supported: true
	 Link 1, P2P atomics supported: true
	 Link 1, System memory atomics supported: true
	 Link 1, SLI is supported: true
	 Link 1, Link is supported: false

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : Yes
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce RTX 2080 Ti (GPU0) supports UVA: Yes
> GeForce RTX 2080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 43.58GB/s

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 530.58   5.78
     1   5.82 533.16
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 528.50  46.95
     1  46.97 532.39
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 533.91  11.32
     1  11.29 536.62
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 530.74  93.59
     1  93.73 535.10
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.88  11.80
     1  12.15   1.82

   CPU     0      1
     0   2.93   7.74
     1   7.69   2.97
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.87   1.73
     1   1.72   1.82

   CPU     0      1
     0   3.08   2.16
     1   2.14   2.96

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@6e8ef3f22155:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp32
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-05 00:24:38.886533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-05 00:24:39.140620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-05 00:24:39.140675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-05 00:24:39.140687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-05 00:24:39.140693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2019-01-05 00:24:39.140697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2019-01-05 00:24:39.140705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 00:24:39.140712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    46     1   489.6  11.960 0.10000
    47     1   492.0  11.914 0.10000
    48     1   499.8  11.869 0.10000
    49     1   498.0  11.888 0.10000
    50     1   499.9  11.874 0.10000
----------------------------------------------------------------
Images/sec: 490.1 +/- 0.6 (jitter = 1.2)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@6e8ef3f22155:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=128 --num_gpus=2 --fp16
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=128
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  256 global
             128 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-05 00:26:04.149449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-05 00:26:04.400957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-05 00:26:04.401010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-05 00:26:04.401019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-05 00:26:04.401023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2019-01-05 00:26:04.401027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2019-01-05 00:26:04.401036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 00:26:04.401042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    46     1   760.5  10.059 0.10000
    47     1   759.2  10.051 0.10000
    48     1   761.7  10.032 0.10000
    49     1   753.6  10.044 0.10000
    50     1   763.0  10.044 0.10000
----------------------------------------------------------------
Images/sec: 760.9 +/- 0.5 (jitter = 2.8)
----------------------------------------------------------------

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@6a2b08b4e5cd:/workspace/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=/projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448
...
2019-01-05 00:43:00.303275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2019-01-05 00:43:00.303281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y
2019-01-05 00:43:00.303285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N
2019-01-05 00:43:00.306050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10014 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 00:43:00.400029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10171 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Processing file: /projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00074-of-00100
Finished processing!
Iteration 1, time = 20.42s, wps = 877, train loss = 12.9972
Iteration 2, time = 16.79s, wps = 1067, train loss = 12.9753
Iteration 3, time = 1.10s, wps = 16339, train loss = 12.9377
Iteration 4, time = 1.10s, wps = 16297, train loss = 12.8125
Iteration 5, time = 1.07s, wps = 16732, train loss = 12.6209
Iteration 6, time = 1.10s, wps = 16316, train loss = 11.4865
Iteration 7, time = 1.07s, wps = 16753, train loss = 33.7183
Iteration 8, time = 1.08s, wps = 16561, train loss = 61.7707
Iteration 9, time = 1.07s, wps = 16673, train loss = 20.1651