NAMD Molecular Dynamics Performance on NVIDIA GTX 1080 and 1070 GPU

The new NVIDIA GeForce GTX 1080 and GTX 1070 GPU’s are out and I’ve received a lot of questions about NAMD performance. The short answer is — performance is great! I’ve got some numbers to back that up below. We’ve got new Broadwell Xeon and Core-i7 CPU’s thrown into the mix too. The new hardware refresh gives a nice step up in performance.

This post is a follow-up/refresh of an earlier post on NAMD performance on workstations with GPU acceleration. This post will mostly focus on new performance numbers

NAMD

NAMD is a widely used molecular dynamics program developed and maintained by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign.

It is proprietary software licensed by the University of Illinois and is made freely available, including source code, under a non-exclusive, non-commercial use license.

The group at UIUC working on NAMD were early pioneers of using GPU’s for compute acceleration and NAMD has very good performance acceleration using NVIDIA CUDA.

Obtaining NAMD

NAMD is available as source that you can compile yourself or in a variety of binary builds.

The binary builds that I will use for testing are from the version 2.11 builds.

  • Linux-x86_64-multicore for CPU based SMP parallel tests
  • Linux-x86_64-multicore-CUDA for GPU accelerated parallel tests

Test configurations

We are looking at single node GPU accelerated workstation performance and will test on three base system configurations.

The Peak Tower Dual
CPU: (2) Intel Xeon E5 2687v4 12-core @ 3.0GHz (3.2GHz All-Core-Turbo)
Memory: 256 GB DDR4 2133GHz Reg ECC
PCIe: (4) X16-X16 v3
The Peak Tower Single
CPU: (1) Intel Xeon E5 2690v4 14-core @ 2.6GHz (3.2GHz All-Core-Turbo
Memory: 64 GB DDR4 2133MHz Reg ECC
PCIe: (4) X16-X16 v3
The Peak Tower Single
CPU: Intel Core-i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
Memory: 64 GB DDR4 2133MHz Reg ECC
PCIe: (4) X16-X16 v3

Note: All-Core-Turbo is the “real” clock speed i.e. the speed when all of the cores are at 100% load.

OS

The system software for the testing was Ubuntu 16.04 plus updates. The NVIDIA driver was version 367.27 from the “Graphics-Drivers ppa”.

GPU’s

I firmly believe using GeForce cards for scientific computation is OK especially for workstations. However, I should note that Tesla model cards from NVIDIA are designed for compute and offer some additional features and are a better choice for multi-node cluster applications. NVIDA will be releasing a PCIe version of Pascal based Tesla later this year. These cards should have the same 1:2 ratio single to double precision performance of the GP100 SMX modules. The GeForce Pascal cards have very poor double precision performance. (However, most GPU accelerated application make very good use of single precision floating point on the GPU!)

Newer GeForce cards have very good thermal and power design. Our experience using cards from top tier vendors like EVGA and ASUS is that they have been excellent with very low failure rates even under heavy computational load. It used to be that the most important consideration when picking cards for compute was to avoid anything that was overclocked. These days it’s actually hard to find any cards that are not overclocked! Manufactures will by default overclock cards that are well within design specs. That makes me nervous because I’ve been doing this for a long time, but I concede that the newer cards like the Maxwell and Pascal GeForce cards are excellent designs. I would not be too concerned about overclocked cards but I would still probably recommend avoiding “superclocked” cards. [ I think “superclocked” is the new “overclocked” ] Video cards from top tier manufacturers with good cooling hardware will give good performance and hold up well to heavy load. They are also inexpensive enough that if they show any sign of failure or inconsistency you should plan to replace them without hesitation. Budget for that! My personal expectation when using GeForce cards for compute is to assume you will be replacing some cards in 6 to 9 months if they are under constant heavy load. They may very well hold up for several years and by then you will be replacing them with faster cards anyway!

The GTX 1080 and 1070 cards used in this testing are “Founders Edition” cards

Video cards used for testing. ( data from nvidia-smi )

Card CUDA cores GPU clock MHz Memory clock MHz* Application clock MHz** FB Memory MiB
GTX 1070 1920 1506 4004 1506 8110
GTX 1080 2560 1607 5005 1607 8192
TITAN X 3072 1392 3505 1000 12287

Notes: * Marketing magic often reports twice that number as MT/s, ** a.k.a. base clock

Testing Simulations

The test simulations data and configuration files can be downloaded from the NAMD utilities page. All jobs were run using the default configuration files (500 time steps).

ApoA1 benchmark [ apoa1.namd ]
Apolipoprotein A-I
2,224 atoms, periodic, PME (Particle Mesh Ewald)
ATPase benchmark [ f1atpase.namd ]
Adenosine tri-phosphate (ATP) synthase
92,224 atoms, periodic, PME
STMV benchmark [ stmv.namd ]
Satellite Tobacco Mosaic Virus
1,066,628 atoms, periodic, PME

Results

The numbers mostly speak for themselves but do read the “Notes:” section at the end of each table.

There are results for each of the test platforms for “CPU only” and “CPU + GPU” job runs. NAMD is essentially linearly scaling with CPU core count. Also, hyperthreading improved performance in every case for CPU only job runs. However, with GPU accelerated job runs hyperthreading slowed job run times.

The GPU acceleration for NAMD is very good. Adding nearly any NVIDIA CUDA capable GPU will significantly improve performance. There is diminishing returns when the GPU capability exceeds the CPU’s ability to keep up. These results indicate that with fast modern GPU’s this version of NAMD ( 2.11 ) is mostly CPU bound.

Caveats:

Heavy compute on GeForce cards can shorten their lifetime! I believe it is perfectly fine to use these cards but keep in mind that you may fry one now and then!

The numbers should not be taken as definitive benchmark results! There can be considerable variation in both runtime and the all important day/ns numbers. The “days per nano-second” has the most variability since I just used the number that was reported at the last “benchmark” phase of each job run and that is not necessarily the best result of all of the “benchmark” reports during the job run. Also, these jobs were only run for 500 time steps a “real” job would run significantly longer.

Peak Tower Dual –Xeon (2) E5 2687W v4 12-core
@ 3.0GHz ( 3.2GHz )[CPU and GPU Acceleration Results]

apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
CPU 19.7 0.370 55.7 1.09 181.4 3.73
Titan X 5.67 0.0954 16.4 0.289 48.3 0.851
(2) Titan X 4.27 0.0593 11.9 0.168 36.7 0.548
GTX 1070 4.94 0.0757 14.7 0.246 56.0 0.796
(2) GTX 1070 4.19 0.0477 11.7 0.154 36.4 0.532
GTX 1080 4.45 0.0653 13.1 0.207 40.3 0.652
(2)GTX 1080 4.08 0.0472 11.7 0.147 35.4 0.504
Notes:
Hyperthreading was enabled for the CPU results ( 48 HT core ). For the GPU results only “real” cores were used ( 24 cores ).
It is notable that the GTX 1070 outperformed the Titan X

Peak Tower Single –Xeon E5 2690 v4 14-core
@ 2.6GHz ( 3.2GHz )[CPU and GPU Acceleration Results]

apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
CPU 31.3 0.629 88.0 1.83 297.2 6.33
Titan X 5.48 0.0896 17.0 0.278 52.5 0.944
(2) Titan X 5.29 0.0749 15.8 0.237 47.7 0.808
GTX 1070 5.25 0.0785 16.3 0.261 51.4 0.908
(2) GTX 1070 5.19 0.0708 16.1 0.238 47.9 0.811
GTX 1080 5.11 0.0731 16.1 0.243 48.5 0.831
(2)GTX 1080 5.13 0.0716 16.2 0.239 47.7 0.809
Notes:
Hyperthreading was enabled for the CPU results ( 28 HT core ). For the GPU results only “real” cores were used ( 14 cores ).
These results are CPU bound when GPU acceleration is used!

Peak Tower Single –Core-i7 6900K 8-core
@ 3.2GHz ( 3.5GHz )[CPU and GPU Acceleration Results]

apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
CPU 45.8 0.975 134.0 2.92 457.0 10.0
Titan X 6.16 0.102 20.4 0.338 62.4 1.14
(2) Titan X 6.44 0.102 20.4 0.338 62.6 1.13
GTX 1070 6.26 0.103 20.5 0.342 62.2 1.13
(2) GTX 1070 6.38 0.102 20.6 0.338 62.7 1.13
GTX 1080 6.41 0.102 20.5 0.343 61.9 1.12
(2)GTX 1080 6.43 0.102 20.5 0.337 63.2 1.12
Notes:
Hyperthreading was enabled for the CPU results ( 16 HT core ). For the GPU results only “real” cores were used ( 8 cores ).
These results are CPU bound when GPU acceleration is used!

Conclusions and recommendations

Running NAMD with GPU acceleration can increase performance by a factor of 8-10 over CPU alone! This is enough performance to facilitate moderate sized MD simulations to be run in a reasonable amount of time on a single node workstation. The folks at UUIC are constantly working on NAMD and I’m sure that future versions will have more of the work load moving to the GPU since there is still significant performance to be gained. Even though the Peak systems can accommodate 4 X16 GPU’s it seems that more than 2 ( or even just 2 ) may be more GPU compute performance than what the CPU’s can keep up with. I think either a Peak Single with a GTX 1080 or a Peak Dual with 2 GTX 1080’s or 2 GTX 1070’s with as much CPU as your budget will allow would be excellent for NAMD at this point in time.

Happy computing –dbk