在上一篇文章中,我初步搭建了軟硬體平台:

拓荒犬:3A平台上搭建機器學習平台(一)硬體和軟體平台的初步搭建?

zhuanlan.zhihu.com
圖標

平台搭建好了,由於想知道該平台的一個整體性能,所以想要跑些Benchmarks。由於TensorFlow使用的人群較多,打算集中在TensorFlow,也沒有太多精力去折騰其他的framework。

了解到TensorFlow有自己的官方Benchmarks:tensorflow/benchmarks,裡面的tf_cnn_benchmarks包含了resnet50, resnet152, inception3, vgg16, googlenet, alexnet等模型,只需要簡單地提供一些參數,便可開始測試,然後給出該平台的性能。

由於是第一次使用TensorFlow,所以遇到不少坑,當然也有很多是因為自身經驗不足的原因導致的,所以這裡都會記錄我踩坑的經歷,如果想要看VEGA64具體的Benchmarks表現,直接拉到最後看結果即可。


首先遇到的問題是,從git上clone TensorFlow benchmarks後,執行以下指令,報錯,說缺少相應的文件。

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

提示缺少gradients_util文件

然後發現git上的TensorFlow有gradients_util.py,於是就懷疑是AMD的tensorflow-rocm文件不全,所以就自己手動添加相應的文件。可是添加好一個文件後,又報另一個文件缺失的錯誤,連續這樣好幾次,我覺得也不是辦法,所以就想能否安裝tensorflow-gpu來補全缺失的文件。雖然我了解到tensorflow-gpu明確需要CUDA的支持,需要run在NVIDIA GPU上面,但是還是抱著試一試的心態。

試一試的心態安裝了-gpu
安裝好後,果然報錯,缺少相應的cuda9.0

於是我只好卸載了tensorflow-gpu,然後再想找下還有沒有別的benchmarks可以run在AMD的平台,於是就在git上搜索AMD TensorFlow,還真找到這麼一個項目:

noxouille/nvamdbench?

github.com
圖標

這個項目雖然也是用的是TensorFlow官方的Benchmarks,但是是用的docker的ROCm,所以這樣就不會缺少相應的文件?兩者相Match?TensorFlow和Benchmark的版本匹配問題其實在我安裝TensorFlow-gpu之前就考慮過這個問題,這裡有個坑,待會再填。

雖然之前沒有接觸過docker,但是看了一下簡介,大致了解可以當做是一個打包好的鏡像,但是要比傳統的鏡像精簡而且效率更高,後續有空再深入研究吧。安裝docker,然後pull rocm/tensorflow,發現這個鏡像還是蠻大的,下載大小大概有2GB,下下來後查看有5.21GB。

弄好了docker,相應的測試腳本終於能work了,然後跑了下Benchmarks,使用rocm-smi查看GPU的狀態,發現GPU並沒有一直處於100%的利用率,有可能是使用docker的原因吧,這個問題後續再研究,而且後面也發現使用docker的速度好像沒有直接使用TensorFlow的速度快。但是後續和docker下的ROCm比對了一下測試結果,發現兩者性能幾乎沒有差距,都在同一水平,甚至有些測試結果docker的值還比Python下的TensorFlow多幾個點,不都都差不多。看來Docker確實有可用之處的,如果嫌自己安裝ROCm下的TensorFlow麻煩,或者怕弄亂了自己的環境,也可以下載一個ROCm的docker試一試,先看看具體的表現如何。

有了測試結果,就想著和ROCm git上其他人的測試結果對比下:Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? · Issue #173 · ROCmSoftwarePlatform/tensorflow-upstream

在查看他人的結果過程中,也發現了有人遇到我相關的問題,提示ImportError。然後有人給出相應的解決方法是到Benchmarks的路徑下執行git checkout cnn_tf_v1.11_compatible。就是checkout到與TensorFlow對應版本,這個我也知道,因為tf_cnn_benchmarks裡面就明確提到了要對應版本,git上最新的benchmarks對應的是tf-nightly-gpu版本。

Note that the master branch of tf_cnn_benchmarks requires the latest nightly version of TensorFlow. You can install the nightly version by running pip install tf-nightly-gpu in a clean environment, or by installing TensorFlow from source. We sometimes will create a branch of tf_cnn_benchmarks, in the form of cnn_tf_vX.Y_compatible, that is compatible with TensorFlow version X.Y For example, branch cnn_tf_v1.9_compatible works with TensorFlow 1.9.

當時我是怎麼操作的呢?我直接在網頁中切換到了v1.12

然後直接複製地址去clone

我根本不知道我當時怎麼想的,根本沒有留意到地址甚至沒有帶版本號,只能說我對github還是太陌生了,平時也沒有用github,只知道簡單的clone, pull,甚至忘記了要切換版本要使用checkout。

然後我立馬使用將Benchmarks切換到v1.12版本,果然不會報錯,可以正常執行了。我就因為這個小疏忽,折騰了這麼久。獲得的經驗教訓就是,以後遇到問題,一定要先靜下來,細心分析好,再去解決,某個方法無法解決,有可能並不是方法不奏效,而是你壓根就沒用正確地執行這個方法。

既然能work了,那麼立即來測試一下各個model相應的Benchmark成績吧。測試前其實還出了一個烏龍,我將batch_size設置成了128,TensorFlow給我報錯說ResourceExhaustedError。然後了解到這個錯誤的原因是因為GPU資源耗盡,而Batch size的含義是批處理數據量,簡單說就是我一批處理的數據太多了,GPU那可憐的8GB內存不夠用。

還好還是,只要TensorFlow平台還穩定著沒事,因為目前我的Machine Learning課程才剛剛學到Backpropagation這一部分,還沒接觸到相關的概念。我平時對於頂層應用和軟體層接觸比較少,所以相關的經驗不足,不過對底層的CPU和GPU的Microarchitecture比較了解,就是大概知道有哪些Modules,各個Modules具體的功能是什麼,他們之間是如何配合工作的。如何後續有時間的話,可以考慮和大家分享一下我對了底層硬體的一些了解。


ROCm版本顯示為2.1.96,OS為ubuntu18.04,以下是TensorFlow輸出的具體硬體信息:

name: Vega [Radeon RX Vega]

AMDGPU ISA: gfx900memoryClockRate (GHz) 1.63pciBusID 0000:43:00.0Total memory: 7.98GiBFree memory: 7.73GiB

TensorFlow: 1.12

以下所有測試,每種類型測試執行5次,取結果最好的成績:

  • ResNet50

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Step Img/sec total_loss

1 images/sec: 190.0 +/- 0.0 (jitter = 0.0) 8.21710 images/sec: 190.7 +/- 0.4 (jitter = 1.3) 8.12420 images/sec: 190.7 +/- 0.4 (jitter = 0.9) 8.23030 images/sec: 190.6 +/- 0.4 (jitter = 0.8) 8.25640 images/sec: 190.4 +/- 0.3 (jitter = 0.7) 8.34750 images/sec: 190.5 +/- 0.3 (jitter = 0.7) 8.003

60 images/sec: 190.6 +/- 0.2 (jitter = 0.7) 8.260

70 images/sec: 190.6 +/- 0.2 (jitter = 0.7) 8.28980 images/sec: 190.6 +/- 0.2 (jitter = 0.6) 8.23290 images/sec: 190.6 +/- 0.2 (jitter = 0.6) 8.307100 images/sec: 190.6 +/- 0.2 (jitter = 0.6) 8.305----------------------------------------------------------------total images/sec: 190.58----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step Img/sec total_loss

1 images/sec: 167.7 +/- 0.0 (jitter = 0.0) 8.45810 images/sec: 171.9 +/- 0.6 (jitter = 0.9) 7.998

20 images/sec: 171.7 +/- 0.5 (jitter = 0.7) 8.260

30 images/sec: 171.7 +/- 0.4 (jitter = 0.7) 8.33940 images/sec: 171.6 +/- 0.4 (jitter = 0.7) 8.18650 images/sec: 171.5 +/- 0.4 (jitter = 0.7) 7.74960 images/sec: 171.5 +/- 0.3 (jitter = 0.8) 8.06470 images/sec: 171.6 +/- 0.3 (jitter = 0.9) 8.47380 images/sec: 171.7 +/- 0.3 (jitter = 0.8) 8.30390 images/sec: 171.7 +/- 0.2 (jitter = 0.8) 8.024100 images/sec: 171.8 +/- 0.2 (jitter = 0.8) 7.985----------------------------------------------------------------

total images/sec: 171.70

----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Step Img/sec total_loss

1 images/sec: 206.4 +/- 0.0 (jitter = 0.0) 8.21710 images/sec: 208.3 +/- 0.8 (jitter = 0.8) 8.12020 images/sec: 208.7 +/- 0.5 (jitter = 1.1) 8.23030 images/sec: 208.3 +/- 0.4 (jitter = 2.4) 8.26640 images/sec: 208.5 +/- 0.3 (jitter = 2.0) 8.35950 images/sec: 208.8 +/- 0.3 (jitter = 1.4) 7.99960 images/sec: 208.8 +/- 0.3 (jitter = 1.4) 8.282

70 images/sec: 208.8 +/- 0.2 (jitter = 1.4) 8.312

80 images/sec: 208.7 +/- 0.2 (jitter = 1.4) 8.22190 images/sec: 208.8 +/- 0.2 (jitter = 1.2) 8.310100 images/sec: 208.8 +/- 0.2 (jitter = 1.1) 8.288----------------------------------------------------------------total images/sec: 208.78----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step Img/sec total_loss

1 images/sec: 181.9 +/- 0.0 (jitter = 0.0) 8.45810 images/sec: 187.6 +/- 0.9 (jitter = 1.2) 7.99720 images/sec: 187.2 +/- 0.7 (jitter = 1.4) 8.261

30 images/sec: 187.5 +/- 0.5 (jitter = 1.3) 8.337

40 images/sec: 187.7 +/- 0.4 (jitter = 1.0) 8.18150 images/sec: 187.8 +/- 0.4 (jitter = 1.1) 7.75160 images/sec: 187.7 +/- 0.4 (jitter = 1.2) 8.06370 images/sec: 187.7 +/- 0.3 (jitter = 1.2) 8.48380 images/sec: 187.8 +/- 0.3 (jitter = 1.2) 8.30690 images/sec: 187.8 +/- 0.3 (jitter = 1.2) 8.038100 images/sec: 187.8 +/- 0.3 (jitter = 1.2) 8.000----------------------------------------------------------------total images/sec: 187.76----------------------------------------------------------------

  • AlexNet

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=alexnet

Step Img/sec total_loss

1 images/sec: 1575.9 +/- 0.0 (jitter = 0.0) 7.20010 images/sec: 1576.9 +/- 1.5 (jitter = 2.1) 7.19820 images/sec: 1573.9 +/- 3.4 (jitter = 2.1) 7.19830 images/sec: 1575.4 +/- 2.4 (jitter = 3.3) 7.20040 images/sec: 1575.9 +/- 2.1 (jitter = 4.7) 7.19950 images/sec: 1570.4 +/- 3.1 (jitter = 5.6) 7.19960 images/sec: 1570.1 +/- 2.9 (jitter = 5.6) 7.19970 images/sec: 1571.4 +/- 2.5 (jitter = 5.4) 7.19980 images/sec: 1572.2 +/- 2.2 (jitter = 5.4) 7.19990 images/sec: 1572.7 +/- 2.0 (jitter = 4.9) 7.201100 images/sec: 1573.3 +/- 1.8 (jitter = 4.7) 7.199----------------------------------------------------------------total images/sec: 1573.01----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=alexnet

Step Img/sec total_loss

1 images/sec: 1402.2 +/- 0.0 (jitter = 0.0) 7.20010 images/sec: 1397.5 +/- 6.8 (jitter = 15.7) 7.20120 images/sec: 1413.7 +/- 5.2 (jitter = 19.6) 7.20030 images/sec: 1412.1 +/- 5.5 (jitter = 21.2) 7.20040 images/sec: 1414.8 +/- 4.3 (jitter = 18.6) 7.19650 images/sec: 1417.3 +/- 3.5 (jitter = 15.9) 7.19860 images/sec: 1418.7 +/- 3.0 (jitter = 15.0) 7.19970 images/sec: 1419.4 +/- 2.6 (jitter = 11.2) 7.19980 images/sec: 1420.1 +/- 2.4 (jitter = 13.0) 7.19990 images/sec: 1419.9 +/- 2.3 (jitter = 15.2) 7.200100 images/sec: 1421.1 +/- 2.1 (jitter = 15.2) 7.199----------------------------------------------------------------total images/sec: 1420.65----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=alexnet

Step Img/sec total_loss

1 images/sec: 1316.0 +/- 0.0 (jitter = 0.0) 7.20210 images/sec: 1331.5 +/- 4.2 (jitter = 17.9) 7.19920 images/sec: 1337.5 +/- 3.0 (jitter = 12.9) 7.20030 images/sec: 1338.9 +/- 2.3 (jitter = 12.7) 7.20040 images/sec: 1341.2 +/- 1.9 (jitter = 12.2) 7.20050 images/sec: 1342.5 +/- 1.6 (jitter = 11.1) 7.19960 images/sec: 1344.0 +/- 1.5 (jitter = 10.2) 7.19870 images/sec: 1343.9 +/- 1.4 (jitter = 10.7) 7.19980 images/sec: 1344.3 +/- 1.5 (jitter = 10.4) 7.19990 images/sec: 1345.5 +/- 1.4 (jitter = 10.8) 7.199100 images/sec: 1346.6 +/- 1.3 (jitter = 11.4) 7.199----------------------------------------------------------------total images/sec: 1345.73----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=alexnet

Step Img/sec total_loss

1 images/sec: 1136.7 +/- 0.0 (jitter = 0.0) 7.19710 images/sec: 1147.9 +/- 3.8 (jitter = 16.4) 7.20020 images/sec: 1150.9 +/- 2.8 (jitter = 11.5) 7.20030 images/sec: 1152.7 +/- 2.2 (jitter = 12.9) 7.20040 images/sec: 1153.1 +/- 1.9 (jitter = 8.7) 7.20050 images/sec: 1152.9 +/- 1.7 (jitter = 8.5) 7.20060 images/sec: 1153.0 +/- 1.4 (jitter = 7.6) 7.20070 images/sec: 1152.4 +/- 1.5 (jitter = 7.8) 7.19980 images/sec: 1152.6 +/- 1.4 (jitter = 7.6) 7.19990 images/sec: 1152.9 +/- 1.3 (jitter = 7.4) 7.200100 images/sec: 1153.1 +/- 1.2 (jitter = 7.9) 7.200----------------------------------------------------------------total images/sec: 1151.98----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=alexnet

Step Img/sec total_loss

1 images/sec: 968.5 +/- 0.0 (jitter = 0.0) nan10 images/sec: 976.0 +/- 4.3 (jitter = 9.1) nan20 images/sec: 975.0 +/- 2.9 (jitter = 10.5) nan30 images/sec: 974.4 +/- 2.2 (jitter = 9.7) nan40 images/sec: 974.6 +/- 1.8 (jitter = 9.1) nan50 images/sec: 973.9 +/- 1.6 (jitter = 8.0) nan60 images/sec: 974.1 +/- 1.4 (jitter = 7.0) nan70 images/sec: 974.1 +/- 1.3 (jitter = 7.4) nan80 images/sec: 973.2 +/- 1.7 (jitter = 7.4) nan90 images/sec: 973.2 +/- 1.5 (jitter = 7.0) nan100 images/sec: 973.4 +/- 1.4 (jitter = 7.0) nan----------------------------------------------------------------total images/sec: 971.85----------------------------------------------------------------

  • Inception v3

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3

Step Img/sec total_loss

1 images/sec: 103.7 +/- 0.0 (jitter = 0.0) 7.42110 images/sec: 104.2 +/- 0.1 (jitter = 0.3) 7.39520 images/sec: 104.1 +/- 0.1 (jitter = 0.4) 7.48330 images/sec: 103.9 +/- 0.1 (jitter = 0.5) 7.40240 images/sec: 103.8 +/- 0.1 (jitter = 0.5) 7.35250 images/sec: 103.9 +/- 0.1 (jitter = 0.6) 7.39260 images/sec: 103.9 +/- 0.1 (jitter = 0.5) 7.39470 images/sec: 103.8 +/- 0.1 (jitter = 0.6) 7.38480 images/sec: 103.8 +/- 0.1 (jitter = 0.6) 7.35390 images/sec: 103.8 +/- 0.1 (jitter = 0.5) 7.412100 images/sec: 103.8 +/- 0.1 (jitter = 0.5) 7.355----------------------------------------------------------------total images/sec: 103.82----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=inception3

Step Img/sec total_loss

1 images/sec: 97.7 +/- 0.0 (jitter = 0.0) 7.40110 images/sec: 98.4 +/- 0.4 (jitter = 0.9) 7.43820 images/sec: 98.6 +/- 0.2 (jitter = 0.9) 7.35030 images/sec: 98.5 +/- 0.2 (jitter = 0.9) 7.47940 images/sec: 98.5 +/- 0.2 (jitter = 0.8) 7.35950 images/sec: 98.5 +/- 0.1 (jitter = 0.8) 7.33060 images/sec: 98.5 +/- 0.1 (jitter = 0.7) 7.40270 images/sec: 98.5 +/- 0.1 (jitter = 0.7) 7.35480 images/sec: 98.5 +/- 0.1 (jitter = 0.6) 7.34290 images/sec: 98.5 +/- 0.1 (jitter = 0.7) 7.508100 images/sec: 98.5 +/- 0.1 (jitter = 0.7) 7.443----------------------------------------------------------------total images/sec: 98.50----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3

Step Img/sec total_loss

1 images/sec: 109.2 +/- 0.0 (jitter = 0.0) 7.42810 images/sec: 109.8 +/- 0.2 (jitter = 0.7) 7.41520 images/sec: 109.9 +/- 0.1 (jitter = 0.5) 7.48630 images/sec: 109.8 +/- 0.1 (jitter = 0.4) 7.40540 images/sec: 109.8 +/- 0.1 (jitter = 0.4) 7.34850 images/sec: 109.8 +/- 0.1 (jitter = 0.4) 7.39760 images/sec: 109.7 +/- 0.1 (jitter = 0.3) 7.36470 images/sec: 109.7 +/- 0.1 (jitter = 0.3) 7.41380 images/sec: 109.7 +/- 0.1 (jitter = 0.3) 7.36890 images/sec: 109.7 +/- 0.1 (jitter = 0.3) 7.421100 images/sec: 109.7 +/- 0.0 (jitter = 0.3) 7.348----------------------------------------------------------------total images/sec: 109.66----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=inception3

Step Img/sec total_loss

1 images/sec: 105.3 +/- 0.0 (jitter = 0.0) 7.36410 images/sec: 105.0 +/- 0.5 (jitter = 0.4) 7.37120 images/sec: 105.2 +/- 0.3 (jitter = 0.7) 7.31230 images/sec: 105.2 +/- 0.2 (jitter = 0.7) 7.50140 images/sec: 105.3 +/- 0.2 (jitter = 0.6) 7.36250 images/sec: 105.3 +/- 0.1 (jitter = 0.6) 7.33860 images/sec: 105.2 +/- 0.1 (jitter = 0.6) 7.42170 images/sec: 105.2 +/- 0.1 (jitter = 0.6) 7.30980 images/sec: 105.2 +/- 0.1 (jitter = 0.6) 7.37790 images/sec: 105.2 +/- 0.1 (jitter = 0.6) 7.520100 images/sec: 105.2 +/- 0.1 (jitter = 0.6) 7.367----------------------------------------------------------------total images/sec: 105.20----------------------------------------------------------------

  • VGG16

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=vgg16

Step Img/sec total_loss

1 images/sec: 101.8 +/- 0.0 (jitter = 0.0) 7.29110 images/sec: 101.8 +/- 0.2 (jitter = 0.7) 7.26520 images/sec: 101.9 +/- 0.1 (jitter = 0.5) 7.27730 images/sec: 102.0 +/- 0.1 (jitter = 0.2) 7.24840 images/sec: 102.0 +/- 0.1 (jitter = 0.3) 7.28250 images/sec: 102.0 +/- 0.1 (jitter = 0.3) 7.26760 images/sec: 102.0 +/- 0.1 (jitter = 0.3) 7.26970 images/sec: 102.0 +/- 0.0 (jitter = 0.2) 7.24780 images/sec: 102.0 +/- 0.0 (jitter = 0.3) 7.27390 images/sec: 102.0 +/- 0.0 (jitter = 0.3) 7.254100 images/sec: 102.0 +/- 0.0 (jitter = 0.3) 7.275----------------------------------------------------------------total images/sec: 101.95----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=vgg16

Step Img/sec total_loss

1 images/sec: 91.2 +/- 0.0 (jitter = 0.0) 7.23110 images/sec: 92.0 +/- 0.1 (jitter = 0.2) 7.21920 images/sec: 92.0 +/- 0.1 (jitter = 0.2) 7.28730 images/sec: 91.9 +/- 0.1 (jitter = 0.1) 7.20440 images/sec: 91.9 +/- 0.0 (jitter = 0.2) 7.27950 images/sec: 91.9 +/- 0.1 (jitter = 0.2) 7.28960 images/sec: 91.9 +/- 0.0 (jitter = 0.2) 7.24470 images/sec: 91.9 +/- 0.0 (jitter = 0.2) 7.25180 images/sec: 91.8 +/- 0.0 (jitter = 0.2) 7.24590 images/sec: 91.8 +/- 0.0 (jitter = 0.2) 7.278100 images/sec: 91.8 +/- 0.0 (jitter = 0.2) 7.281----------------------------------------------------------------total images/sec: 91.80----------------------------------------------------------------

  • GoogLeNet

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=googlenet

Step Img/sec total_loss

1 images/sec: 496.7 +/- 0.0 (jitter = 0.0) 7.09710 images/sec: 499.8 +/- 1.1 (jitter = 1.1) 7.09420 images/sec: 500.2 +/- 0.7 (jitter = 1.3) 7.08730 images/sec: 499.6 +/- 0.8 (jitter = 2.3) 7.10440 images/sec: 499.7 +/- 0.6 (jitter = 2.3) 7.10250 images/sec: 499.4 +/- 0.5 (jitter = 2.2) 7.09460 images/sec: 499.3 +/- 0.5 (jitter = 2.4) 7.10170 images/sec: 499.2 +/- 0.4 (jitter = 2.4) 7.11180 images/sec: 499.2 +/- 0.4 (jitter = 2.2) 7.08390 images/sec: 499.0 +/- 0.3 (jitter = 2.3) 7.095100 images/sec: 498.9 +/- 0.3 (jitter = 2.1) 7.094----------------------------------------------------------------total images/sec: 498.73----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=googlenet

Step Img/sec total_loss

1 images/sec: 474.7 +/- 0.0 (jitter = 0.0) 7.08410 images/sec: 474.6 +/- 1.7 (jitter = 2.9) 7.04620 images/sec: 475.9 +/- 1.0 (jitter = 3.0) 7.10430 images/sec: 475.0 +/- 1.0 (jitter = 1.9) 7.09340 images/sec: 474.9 +/- 0.8 (jitter = 2.5) 7.10950 images/sec: 474.6 +/- 0.7 (jitter = 2.3) 7.10160 images/sec: 474.4 +/- 0.7 (jitter = 2.7) 7.08770 images/sec: 474.5 +/- 0.7 (jitter = 2.7) 7.08180 images/sec: 474.3 +/- 0.7 (jitter = 2.4) 7.09790 images/sec: 474.4 +/- 0.6 (jitter = 2.4) 7.096100 images/sec: 474.3 +/- 0.6 (jitter = 2.5) 7.079----------------------------------------------------------------total images/sec: 474.07----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=googlenet

Step Img/sec total_loss

1 images/sec: 418.9 +/- 0.0 (jitter = 0.0) 7.19010 images/sec: 424.7 +/- 2.9 (jitter = 7.4) 7.09820 images/sec: 424.9 +/- 1.6 (jitter = 4.3) 7.09530 images/sec: 425.7 +/- 1.3 (jitter = 4.1) 7.07740 images/sec: 425.0 +/- 1.3 (jitter = 3.6) 7.08250 images/sec: 424.6 +/- 1.1 (jitter = 3.7) 7.06960 images/sec: 424.8 +/- 1.1 (jitter = 3.8) 7.11470 images/sec: 424.8 +/- 1.0 (jitter = 3.9) 7.09780 images/sec: 424.7 +/- 1.0 (jitter = 3.8) 7.10290 images/sec: 424.7 +/- 0.9 (jitter = 3.6) 7.111100 images/sec: 424.7 +/- 0.8 (jitter = 3.3) 7.118----------------------------------------------------------------total images/sec: 424.32----------------------------------------------------------------

ResNet152

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet152

Step Img/sec total_loss

1 images/sec: 68.4 +/- 0.0 (jitter = 0.0) 9.90610 images/sec: 68.9 +/- 0.2 (jitter = 1.1) 9.65720 images/sec: 68.7 +/- 0.1 (jitter = 0.3) 9.67430 images/sec: 68.8 +/- 0.1 (jitter = 0.4) 9.93840 images/sec: 68.9 +/- 0.1 (jitter = 0.5) 9.92250 images/sec: 68.9 +/- 0.1 (jitter = 0.5) 10.06460 images/sec: 68.7 +/- 0.1 (jitter = 0.7) 10.30770 images/sec: 68.7 +/- 0.1 (jitter = 0.7) 10.00680 images/sec: 68.7 +/- 0.1 (jitter = 0.7) 9.87390 images/sec: 68.7 +/- 0.1 (jitter = 0.6) 10.233100 images/sec: 68.7 +/- 0.1 (jitter = 0.6) 10.008----------------------------------------------------------------total images/sec: 68.71----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet152

Step Img/sec total_loss

1 images/sec: 76.6 +/- 0.0 (jitter = 0.0) 9.88910 images/sec: 76.2 +/- 0.2 (jitter = 0.4) 9.66720 images/sec: 76.1 +/- 0.1 (jitter = 0.5) 9.70630 images/sec: 76.0 +/- 0.1 (jitter = 0.6) 9.92140 images/sec: 76.1 +/- 0.1 (jitter = 0.6) 9.96350 images/sec: 76.1 +/- 0.1 (jitter = 0.6) 10.08860 images/sec: 76.0 +/- 0.1 (jitter = 0.6) 10.25970 images/sec: 75.9 +/- 0.1 (jitter = 0.8) 10.02380 images/sec: 75.8 +/- 0.1 (jitter = 0.8) 9.91190 images/sec: 75.8 +/- 0.1 (jitter = 0.8) 10.248100 images/sec: 75.8 +/- 0.1 (jitter = 0.8) 10.052----------------------------------------------------------------total images/sec: 75.81----------------------------------------------------------------

經過一系列的測試,我發現大概等GPU冷卻下來後,執行的第二次測試一般是最好的測試結果,第一次測試差不多相當於warm up(雖然Benchmark中會有warm up的步驟),然後第二次測試可能會得到最好的結果,後面可能由於連續測試,溫度下不來導致Performance有所下降。反正還是要多測試測試,最好Performance和最差Performance有時還是會有一定差距的。

我也嘗試過添加--use_fp16選項,發現和不添加結果相差不大,可能後續ROCm版本相應的策略有所更改吧,後面有空了再細究。

還有就是添加TF_ROCM_FUSION_ENABLE=1選項確實會在一定程度上提升ResNet的性能,有興趣深究可以查看如下網頁中,Fusion Support這一部分的內容。由於對於某些Framework,TF_ROCM_FUSION_ENABLE選項影響不大,所以有些Framework的測試沒有給出添加該選項的結果。

ROCmSoftwarePlatform/tensorflow-upstream?

github.com
圖標

測試結果匯總:

後續可能會針對具體的硬體資源做一些分析吧,還有就是查看更加具體的一些性能表現,比如clk, power, bandwidth, latency, hit/miss rate等等。


推薦閱讀:
相关文章