Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Open

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN?#173

opened

on Sep 20, 2018

It would be very useful to compare real training performance on amd and nvidia cards.
For Nvidia cards we have a lot of graphs and tests, for example:
https://github.com/u39kun/deep-learning-benchmark
But for AMD cards there is no performance metrics.
It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.

pricebenjamin

If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the benchmarks/scripts/tf_cnn_benchmarks directory provides some example usage.

Those scripts were used for the benchmarks shown on TensorFlows website.

I've run the following on a Vega FE (tensorflow-rocm==1.11.0 and rocm-dkms==1.9.211).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

This yields the following.

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 182.2 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 182.3 +/- 0.1 (jitter = 0.2)	8.170
20	images/sec: 182.3 +/- 0.1 (jitter = 0.3)	8.247
30	images/sec: 182.1 +/- 0.1 (jitter = 0.3)	8.369
40	images/sec: 182.0 +/- 0.1 (jitter = 0.4)	8.401
50	images/sec: 181.9 +/- 0.1 (jitter = 0.5)	8.147
60	images/sec: 181.8 +/- 0.1 (jitter = 0.6)	8.340
70	images/sec: 181.6 +/- 0.1 (jitter = 0.7)	8.120
80	images/sec: 181.3 +/- 0.2 (jitter = 0.9)	8.415
90	images/sec: 180.5 +/- 0.3 (jitter = 1.1)	8.278
100	images/sec: 179.5 +/- 0.4 (jitter = 1.4)	8.328
----------------------------------------------------------------
total images/sec: 179.44
----------------------------------------------------------------

For comparison, the same command being run on a Tesla P100-PCIE-16GB (CUDA==9.2, cuDNN==7.1.4, and tf.__version__ == '1.11.0')

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 248.6 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 248.6 +/- 0.2 (jitter = 0.6)	8.164
20	images/sec: 248.5 +/- 0.1 (jitter = 0.8)	8.251
30	images/sec: 248.4 +/- 0.1 (jitter = 0.7)	8.355
40	images/sec: 248.3 +/- 0.1 (jitter = 0.6)	8.417
50	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.152
60	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.353
70	images/sec: 248.1 +/- 0.1 (jitter = 0.7)	8.109
80	images/sec: 247.7 +/- 0.1 (jitter = 0.8)	8.405
90	images/sec: 247.5 +/- 0.1 (jitter = 0.9)	8.266
100	images/sec: 247.2 +/- 0.2 (jitter = 1.2)	8.344
----------------------------------------------------------------
total images/sec: 247.13
----------------------------------------------------------------

Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.

Mandrewoid

@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:

ImportError: No module named 'tensorflow.python.data.experimental'

pricebenjamin

@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.

cd /path/to/benchmarks
git checkout cnn_tf_v1.11_compatible

Mandrewoid

Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake

kazulittlefox

I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=(32|64)  \ 
--model=(alexnet|inceptionv3|vgg16|googlenet|resnet50)

result are as follow:

AlexNet        batch:32 397.27/sec
                     batch:64 518.03/sec
InceptionV3 batch:32   47.78/sec
                    batch:64   50.66/sec
googLeNet batch:32 239.28/sec
                   batch:64 256.05/sec
ResNet50   batch:32  86.81/sec
                 batch:64  98.57/sec

In my environment, Vgg16 has not runnning well.

fshi98

I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12:
1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16)
2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101
1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash
3. inception3: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3
1080ti: 140.08 images/sec (166 fp16)
vega64: 99.02 images/sec (50 fp16)

4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet
1080ti: 2865 images/sec
vega64: 462 images/sec

The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.

There are two values didn't add up:

for mobilenet, the 1080ti result doesn't make sense.
i also tested with --use_fp16, which gives fair amount of speedup for 1080ti. However, for vega64, it ends up slower in all tests if using --use_fp16. This is especially true for inception3.

Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.

@kazulittlefox my vega runs smoothly with vgg16 @105images/sec

Mandrewoid

@fshi98 that might be because of
#143 (comment)

fshi98

@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0,
and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143
Also, i tested and passed ROCm/rocBLAS#340

This may not be the same error bugs as #143, but may be some performance issues

fshi98

mentioned this

on Dec 1, 2018

fp16 performance issues #256

pricebenjamin

@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.

sebpuetz

#288
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 190.3 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 195.7 +/- 0.9 (jitter = 3.1)	8.123
20	images/sec: 196.4 +/- 0.5 (jitter = 1.8)	8.231
30	images/sec: 196.8 +/- 0.4 (jitter = 1.1)	8.268
40	images/sec: 197.1 +/- 0.3 (jitter = 0.9)	8.355
50	images/sec: 197.2 +/- 0.2 (jitter = 0.8)	8.013
60	images/sec: 197.3 +/- 0.2 (jitter = 0.7)	8.263
70	images/sec: 196.8 +/- 0.3 (jitter = 1.1)	8.304
80	images/sec: 196.9 +/- 0.2 (jitter = 1.1)	8.228
90	images/sec: 196.9 +/- 0.2 (jitter = 0.9)	8.283
100	images/sec: 197.0 +/- 0.2 (jitter = 0.8)	8.271
----------------------------------------------------------------
total images/sec: 196.98
----------------------------------------------------------------

FP16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Step	Img/sec	total_loss
1	images/sec: 262.9 +/- 0.0 (jitter = 0.0)	8.162
10	images/sec: 261.9 +/- 0.6 (jitter = 0.7)	8.211
20	images/sec: 260.4 +/- 0.6 (jitter = 2.6)	8.375
30	images/sec: 260.6 +/- 0.5 (jitter = 2.6)	8.264
40	images/sec: 259.6 +/- 0.6 (jitter = 3.1)	8.116
50	images/sec: 259.6 +/- 0.5 (jitter = 3.1)	8.169
60	images/sec: 259.9 +/- 0.5 (jitter = 2.6)	8.325
70	images/sec: 259.3 +/- 0.5 (jitter = 3.5)	8.374
80	images/sec: 259.4 +/- 0.4 (jitter = 3.4)	8.041
90	images/sec: 259.3 +/- 0.4 (jitter = 3.6)	8.298
100	images/sec: 259.4 +/- 0.3 (jitter = 3.5)	8.376
----------------------------------------------------------------
total images/sec: 259.29
----------------------------------------------------------------

This one made the GPU sound like a jet engine:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 216.3 +/- 0.0 (jitter = 0.0)	8.219
10	images/sec: 215.9 +/- 0.3 (jitter = 0.3)	8.289
20	images/sec: 216.0 +/- 0.2 (jitter = 0.3)	8.064
30	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.310
40	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.197
50	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.277
60	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.162
70	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.159
80	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.139
90	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.196
100	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.163
----------------------------------------------------------------
total images/sec: 215.72
----------------------------------------------------------------

FP 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 288.2 +/- 0.0 (jitter = 0.0)	8.209
10	images/sec: 283.8 +/- 1.1 (jitter = 2.7)	8.189
20	images/sec: 284.0 +/- 0.9 (jitter = 4.6)	8.316
30	images/sec: 284.9 +/- 0.7 (jitter = 4.5)	8.195
40	images/sec: 284.5 +/- 0.6 (jitter = 4.0)	8.180
50	images/sec: 284.3 +/- 0.5 (jitter = 3.7)	8.402
60	images/sec: 285.0 +/- 0.5 (jitter = 4.8)	8.271
70	images/sec: 285.4 +/- 0.4 (jitter = 3.7)	8.134
80	images/sec: 285.7 +/- 0.4 (jitter = 2.7)	8.299
90	images/sec: 286.0 +/- 0.4 (jitter = 1.5)	8.349
100	images/sec: 286.2 +/- 0.3 (jitter = 1.4)	8.213
----------------------------------------------------------------
total images/sec: 286.17
----------------------------------------------------------------

sebpuetz

mentioned this

on Feb 16, 2019

Question: will the radeon vii be optimized for fp16? Pls reply when You can. :) (i know it's not out yet) #288

sunway513

Hi @sebpuetz , maybe you can also try to enable fusion support :-) Doc is as following:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/rocm-port-overview.md#fusion-support

sebpuetz

Improvements across the board with TF_ROCM_FUSION_ENABLE=1. The displayed temp in rocm-smi went above 90°C on all tests, the rocm-smi output didn't include clocks so I can't tell whether any termal throttling was happening.

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 208.4 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 207.6 +/- 0.5 (jitter = 0.5)	8.124
20	images/sec: 207.7 +/- 0.3 (jitter = 0.5)	8.235
30	images/sec: 207.3 +/- 0.4 (jitter = 0.4)	8.268
40	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.357
50	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.012
60	images/sec: 207.2 +/- 0.3 (jitter = 0.4)	8.248
70	images/sec: 207.1 +/- 0.3 (jitter = 0.4)	8.305
80	images/sec: 207.0 +/- 0.3 (jitter = 0.5)	8.223
90	images/sec: 205.7 +/- 0.9 (jitter = 0.5)	8.322
100	images/sec: 205.7 +/- 0.8 (jitter = 0.5)	8.268
----------------------------------------------------------------
total images/sec: 205.65
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 273.0 +/- 0.0 (jitter = 0.0)	8.171
10	images/sec: 272.6 +/- 0.9 (jitter = 1.0)	8.223
20	images/sec: 271.5 +/- 1.1 (jitter = 0.9)	8.375
30	images/sec: 272.0 +/- 0.8 (jitter = 0.9)	8.282
40	images/sec: 272.1 +/- 0.6 (jitter = 0.9)	8.122
50	images/sec: 272.1 +/- 0.6 (jitter = 0.8)	8.144
60	images/sec: 272.0 +/- 0.5 (jitter = 0.8)	8.333
70	images/sec: 271.5 +/- 0.5 (jitter = 1.0)	8.357
80	images/sec: 271.2 +/- 0.5 (jitter = 1.3)	8.034
90	images/sec: 271.2 +/- 0.4 (jitter = 1.3)	8.289
100	images/sec: 270.9 +/- 0.4 (jitter = 1.5)	8.361
----------------------------------------------------------------
total images/sec: 270.81
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 227.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.6 +/- 0.5 (jitter = 2.2)	8.289
20	images/sec: 225.5 +/- 0.4 (jitter = 1.9)	8.068
30	images/sec: 225.7 +/- 0.3 (jitter = 1.8)	8.304
40	images/sec: 225.4 +/- 0.5 (jitter = 1.2)	8.183
50	images/sec: 225.5 +/- 0.4 (jitter = 1.0)	8.261
60	images/sec: 225.6 +/- 0.4 (jitter = 1.1)	8.203
70	images/sec: 225.6 +/- 0.3 (jitter = 1.1)	8.165
80	images/sec: 225.6 +/- 0.3 (jitter = 1.0)	8.168
90	images/sec: 225.7 +/- 0.3 (jitter = 1.0)	8.196
100	images/sec: 225.6 +/- 0.2 (jitter = 1.1)	8.138
----------------------------------------------------------------
total images/sec: 225.62
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 302.0 +/- 0.0 (jitter = 0.0)	8.213
10	images/sec: 300.2 +/- 0.5 (jitter = 1.5)	8.181
20	images/sec: 298.7 +/- 0.8 (jitter = 2.5)	8.324
30	images/sec: 297.7 +/- 0.8 (jitter = 2.2)	8.197
40	images/sec: 297.7 +/- 0.6 (jitter = 3.0)	8.173
50	images/sec: 297.9 +/- 0.6 (jitter = 3.0)	8.400
60	images/sec: 297.9 +/- 0.5 (jitter = 3.0)	8.267
70	images/sec: 298.4 +/- 0.5 (jitter = 2.8)	8.140
80	images/sec: 298.6 +/- 0.4 (jitter = 2.7)	8.283
90	images/sec: 298.6 +/- 0.4 (jitter = 2.8)	8.337
100	images/sec: 298.7 +/- 0.4 (jitter = 2.6)	8.208
----------------------------------------------------------------
total images/sec: 298.60
----------------------------------------------------------------

sunway513

Hi @sebpuetz , thanks for the update!
However, the performance numbers seem not right.
Can you provide me the VBIOS version of your board? The following command would do:
/opt/rocm/bin/rocm-smi -v

61 remaining items

sunway513

mentioned this in 2 issues

on Apr 20, 2019

WannaBeOCer

Radeon VII at stock using 18.04 w/ ROCm 2.3. Around a 28% improvement from 2.2

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Done warm up
Step    Img/sec total_loss
1       images/sec: 344.8 +/- 0.0 (jitter = 0.0)        8.123
10      images/sec: 345.6 +/- 0.3 (jitter = 0.7)        7.752
20      images/sec: 345.3 +/- 0.3 (jitter = 0.7)        7.913
30      images/sec: 345.3 +/- 0.3 (jitter = 0.5)        7.785
40      images/sec: 345.3 +/- 0.2 (jitter = 0.6)        7.917
50      images/sec: 345.4 +/- 0.2 (jitter = 0.6)        7.874
60      images/sec: 345.4 +/- 0.2 (jitter = 0.6)        7.720
70      images/sec: 345.4 +/- 0.1 (jitter = 0.6)        8.016
80      images/sec: 345.5 +/- 0.1 (jitter = 0.6)        7.773
90      images/sec: 345.6 +/- 0.1 (jitter = 0.6)        7.800
100     images/sec: 345.6 +/- 0.1 (jitter = 0.6)        8.027
----------------------------------------------------------------
total images/sec: 345.29
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Done warm up
Step    Img/sec total_loss
1       images/sec: 363.3 +/- 0.0 (jitter = 0.0)        8.117
10      images/sec: 363.8 +/- 0.3 (jitter = 0.7)        7.754
20      images/sec: 363.8 +/- 0.4 (jitter = 1.2)        7.906
30      images/sec: 363.7 +/- 0.3 (jitter = 0.8)        7.780
40      images/sec: 363.8 +/- 0.2 (jitter = 0.9)        7.919
50      images/sec: 363.8 +/- 0.2 (jitter = 0.9)        7.889
60      images/sec: 363.7 +/- 0.2 (jitter = 0.9)        7.726
70      images/sec: 363.7 +/- 0.2 (jitter = 0.8)        8.015
80      images/sec: 363.4 +/- 0.2 (jitter = 1.0)        7.772
90      images/sec: 363.3 +/- 0.2 (jitter = 1.1)        7.816
100     images/sec: 363.4 +/- 0.2 (jitter = 1.0)        8.028
----------------------------------------------------------------
total images/sec: 363.13
----------------------------------------------------------------

sunway513

Hi @WannaBeOCer thank you for posting the numbers. However, it's a bit left than what I'd expected. Could you run the benchmark again after applying the following commands? It'll be helpful if you can provide the performance numbers with batch size 128 as well.

clean your MIOpen cache
rm -rf ~/.cache && rm -rf ~/.config
apply the updated performance databse for Radeon VII
cd ~/ && mkdir -p .config/miopen && cd .config/miopen && wget https://www.dropbox.com/s/yd9v7jtc9aydnfy/gfx906_60.cd.updb.txt && cd ~

WannaBeOCer

@sunway513 Thanks for the update, I applied the changes and I do see a performance uplift.

Batch size of 64 without Fusion:

Done warm up
Step    Img/sec total_loss
1       images/sec: 364.2 +/- 0.0 (jitter = 0.0)        8.119
10      images/sec: 365.6 +/- 0.3 (jitter = 0.8)        7.747
20      images/sec: 365.6 +/- 0.2 (jitter = 0.9)        7.912
30      images/sec: 365.6 +/- 0.1 (jitter = 0.5)        7.791
40      images/sec: 364.8 +/- 0.5 (jitter = 0.7)        7.926
50      images/sec: 365.0 +/- 0.4 (jitter = 0.6)        7.891
60      images/sec: 364.9 +/- 0.4 (jitter = 0.7)        7.703
70      images/sec: 364.9 +/- 0.3 (jitter = 0.7)        7.995
80      images/sec: 364.9 +/- 0.3 (jitter = 0.8)        7.771
90      images/sec: 364.9 +/- 0.2 (jitter = 0.8)        7.819
100     images/sec: 365.0 +/- 0.2 (jitter = 0.8)        8.027
----------------------------------------------------------------
total images/sec: 364.68
----------------------------------------------------------------

Batch size of 128 without Fusion:

Before

Done warm up
Step Img/sec total_loss
1 images/sec: 382.1 +/- 0.0 (jitter = 0.0) 7.876
10 images/sec: 381.4 +/- 0.4 (jitter = 1.1) 7.951
20 images/sec: 381.6 +/- 0.3 (jitter = 0.7) 7.950
30 images/sec: 381.8 +/- 0.2 (jitter = 0.8) 7.942
40 images/sec: 381.7 +/- 0.2 (jitter = 0.7) 7.960
50 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.709
60 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.914
70 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.834
80 images/sec: 381.8 +/- 0.1 (jitter = 0.7) 7.966
90 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.803
100 images/sec: 381.6 +/- 0.1 (jitter = 0.9) 7.756
----------------------------------------------------------------
total images/sec: 381.48
----------------------------------------------------------------

After

Done warm up
Step    Img/sec total_loss
1       images/sec: 399.5 +/- 0.0 (jitter = 0.0)        7.875
10      images/sec: 399.9 +/- 0.1 (jitter = 0.5)        7.956
20      images/sec: 399.8 +/- 0.3 (jitter = 0.5)        7.954
30      images/sec: 399.9 +/- 0.2 (jitter = 0.5)        7.939
40      images/sec: 399.9 +/- 0.2 (jitter = 0.7)        7.950
50      images/sec: 399.8 +/- 0.1 (jitter = 0.6)        7.715
60      images/sec: 399.8 +/- 0.1 (jitter = 0.6)        7.920
70      images/sec: 399.8 +/- 0.1 (jitter = 0.7)        7.833
80      images/sec: 399.8 +/- 0.1 (jitter = 0.7)        7.992
90      images/sec: 399.7 +/- 0.1 (jitter = 0.6)        7.802
100     images/sec: 399.7 +/- 0.1 (jitter = 0.6)        7.784
----------------------------------------------------------------
total images/sec: 399.58
----------------------------------------------------------------

With Fusion:

Done warm up
Step    Img/sec total_loss
1       images/sec: 421.5 +/- 0.0 (jitter = 0.0)        7.878
10      images/sec: 422.0 +/- 0.2 (jitter = 0.6)        7.957
20      images/sec: 421.9 +/- 0.1 (jitter = 0.6)        7.952
30      images/sec: 421.8 +/- 0.1 (jitter = 0.6)        7.946
40      images/sec: 421.5 +/- 0.2 (jitter = 0.6)        7.966
50      images/sec: 421.5 +/- 0.1 (jitter = 0.6)        7.708
60      images/sec: 421.6 +/- 0.2 (jitter = 0.6)        7.910
70      images/sec: 421.7 +/- 0.1 (jitter = 0.7)        7.839
80      images/sec: 421.8 +/- 0.1 (jitter = 0.7)        7.960
90      images/sec: 421.7 +/- 0.1 (jitter = 0.7)        7.801
100     images/sec: 421.6 +/- 0.1 (jitter = 0.7)        7.769
----------------------------------------------------------------
total images/sec: 421.41
----------------------------------------------------------------

sunway513

Thank you @WannaBeOCer , would you mind to post the numbers on FP32, fusion enabled?

WannaBeOCer

@sunway513 Here are the numbers on FP32 with Fusion enabled.

Before

Done warm up
Step    Img/sec total_loss
1       images/sec: 277.0 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 277.7 +/- 0.1 (jitter = 0.2)        7.856
20      images/sec: 277.8 +/- 0.1 (jitter = 0.4)        7.913
30      images/sec: 277.9 +/- 0.1 (jitter = 0.4)        7.731
40      images/sec: 278.0 +/- 0.1 (jitter = 0.5)        7.971
50      images/sec: 278.0 +/- 0.1 (jitter = 0.5)        8.027
60      images/sec: 278.0 +/- 0.0 (jitter = 0.4)        7.890
70      images/sec: 277.9 +/- 0.0 (jitter = 0.4)        7.983
80      images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.799
90      images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.792
100     images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.810
----------------------------------------------------------------
total images/sec: 277.79
----------------------------------------------------------------

After

Done warm up
Step    Img/sec total_loss
1       images/sec: 299.5 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 300.4 +/- 0.1 (jitter = 0.1)        7.856
20      images/sec: 300.5 +/- 0.1 (jitter = 0.2)        7.914
30      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.732
40      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.972
50      images/sec: 300.7 +/- 0.0 (jitter = 0.3)        8.026
60      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.894
70      images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.992
80      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.803
90      images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.786
100     images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.795
----------------------------------------------------------------
total images/sec: 300.51
----------------------------------------------------------------

80 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

61 remaining items

80 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Description

Activity

pricebenjamin commented on Nov 8, 2018

Mandrewoid commented on Nov 17, 2018

pricebenjamin commented on Nov 17, 2018

Mandrewoid commented on Nov 17, 2018

kazulittlefox commented on Nov 23, 2018

fshi98 commented on Nov 30, 2018

Mandrewoid commented on Dec 1, 2018

fshi98 commented on Dec 1, 2018

pricebenjamin commented on Feb 16, 2019

sebpuetz commented on Feb 16, 2019

sunway513 commented on Feb 18, 2019

sebpuetz commented on Feb 18, 2019

sunway513 commented on Feb 18, 2019

61 remaining items

WannaBeOCer commented on Apr 20, 2019

sunway513 commented on Apr 20, 2019

WannaBeOCer commented on Apr 20, 2019

sunway513 commented on Apr 20, 2019

WannaBeOCer commented on Apr 21, 2019

80 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions