Open
Description
It would be very useful to compare real training performance on amd and nvidia cards.
For Nvidia cards we have a lot of graphs and tests, for example:
https://github.com/u39kun/deep-learning-benchmark
But for AMD cards there is no performance metrics.
It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.
Activity
pricebenjamin commentedon Nov 8, 2018
If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the
benchmarks/scripts/tf_cnn_benchmarks
directory provides some example usage.Those scripts were used for the benchmarks shown on TensorFlows website.
I've run the following on a Vega FE (
tensorflow-rocm==1.11.0
androcm-dkms==1.9.211
).This yields the following.
For comparison, the same command being run on a Tesla P100-PCIE-16GB (
CUDA==9.2
,cuDNN==7.1.4
, andtf.__version__ == '1.11.0'
)Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.
Mandrewoid commentedon Nov 17, 2018
@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:
ImportError: No module named 'tensorflow.python.data.experimental'
pricebenjamin commentedon Nov 17, 2018
@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.
cd /path/to/benchmarks git checkout cnn_tf_v1.11_compatible
Mandrewoid commentedon Nov 17, 2018
Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake
kazulittlefox commentedon Nov 23, 2018
I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).
result are as follow:
In my environment, Vgg16 has not runnning well.
fshi98 commentedon Nov 30, 2018
I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12:
1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16)
2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101
1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash
3. inception3: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3
1080ti: 140.08 images/sec (166 fp16)
vega64: 99.02 images/sec (50 fp16)
4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet
1080ti: 2865 images/sec
vega64: 462 images/sec
The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.
There are two values didn't add up:
Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.
@kazulittlefox my vega runs smoothly with vgg16 @105images/sec
Mandrewoid commentedon Dec 1, 2018
@fshi98 that might be because of
#143 (comment)
fshi98 commentedon Dec 1, 2018
@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0,
and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143
Also, i tested and passed ROCm/rocBLAS#340
This may not be the same error bugs as #143, but may be some performance issues
pricebenjamin commentedon Feb 16, 2019
@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.
sebpuetz commentedon Feb 16, 2019
#288
Radeon VII
rocm==2.1.96
installed through apttensorflow==1.12
installed through pipno further tuning
FP16:
This one made the GPU sound like a jet engine:
FP 16:
sunway513 commentedon Feb 18, 2019
Hi @sebpuetz , maybe you can also try to enable fusion support :-) Doc is as following:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/rocm-port-overview.md#fusion-support
sebpuetz commentedon Feb 18, 2019
Improvements across the board with
TF_ROCM_FUSION_ENABLE=1
. The displayed temp inrocm-smi
went above 90°C on all tests, therocm-smi
output didn't include clocks so I can't tell whether any termal throttling was happening.sunway513 commentedon Feb 18, 2019
Hi @sebpuetz , thanks for the update!
However, the performance numbers seem not right.
Can you provide me the VBIOS version of your board? The following command would do:
/opt/rocm/bin/rocm-smi -v
61 remaining items
WannaBeOCer commentedon Apr 20, 2019
Radeon VII at stock using 18.04 w/ ROCm 2.3. Around a 28% improvement from 2.2
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
sunway513 commentedon Apr 20, 2019
Hi @WannaBeOCer thank you for posting the numbers. However, it's a bit left than what I'd expected. Could you run the benchmark again after applying the following commands? It'll be helpful if you can provide the performance numbers with batch size 128 as well.
rm -rf ~/.cache && rm -rf ~/.config
cd ~/ && mkdir -p .config/miopen && cd .config/miopen && wget https://www.dropbox.com/s/yd9v7jtc9aydnfy/gfx906_60.cd.updb.txt && cd ~
WannaBeOCer commentedon Apr 20, 2019
@sunway513 Thanks for the update, I applied the changes and I do see a performance uplift.
Batch size of 64 without Fusion:
Batch size of 128 without Fusion:
Before
After
With Fusion:
sunway513 commentedon Apr 20, 2019
Thank you @WannaBeOCer , would you mind to post the numbers on FP32, fusion enabled?
WannaBeOCer commentedon Apr 21, 2019
@sunway513 Here are the numbers on FP32 with Fusion enabled.
Before
After
80 remaining items