NAMD Molecular Dynamics Performance on NVIDIA GTX 1080 and 1070 GPU

Table of Contents

The new NVIDIA GeForce GTX 1080 and GTX 1070 GPU’s are out and I’ve received a lot of questions about NAMD performance. The short answer is — performance is great! I’ve got some numbers to back that up below. We’ve got new Broadwell Xeon and Core-i7 CPU’s thrown into the mix too. The new hardware refresh gives a nice step up in performance.

This post is a follow-up/refresh of an earlier post on NAMD performance on workstations with GPU acceleration. This post will mostly focus on new performance numbers

NAMD

NAMD is a widely used molecular dynamics program developed and maintained by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign.

It is proprietary software licensed by the University of Illinois and is made freely available, including source code, under a non-exclusive, non-commercial use license.

The group at UIUC working on NAMD were early pioneers of using GPU’s for compute acceleration and NAMD has very good performance acceleration using NVIDIA CUDA.

Obtaining NAMD

NAMD is available as source that you can compile yourself or in a variety of binary builds.

The binary builds that I will use for testing are from the version 2.11 builds.

Linux-x86_64-multicore for CPU based SMP parallel tests
Linux-x86_64-multicore-CUDA for GPU accelerated parallel tests

Test configurations

We are looking at single node GPU accelerated workstation performance and will test on three base system configurations.

The Peak Tower Dual: CPU: (2) Intel Xeon E5 2687v4 12-core @ 3.0GHz (3.2GHz All-Core-Turbo); Memory: 256 GB DDR4 2133GHz Reg ECC; PCIe: (4) X16-X16 v3
The Peak Tower Single: CPU: (1) Intel Xeon E5 2690v4 14-core @ 2.6GHz (3.2GHz All-Core-Turbo; Memory: 64 GB DDR4 2133MHz Reg ECC; PCIe: (4) X16-X16 v3
The Peak Tower Single: CPU: Intel Core-i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo); Memory: 64 GB DDR4 2133MHz Reg ECC; PCIe: (4) X16-X16 v3

Note: All-Core-Turbo is the “real” clock speed i.e. the speed when all of the cores are at 100% load.

OS

The system software for the testing was Ubuntu 16.04 plus updates. The NVIDIA driver was version 367.27 from the “Graphics-Drivers ppa”.

GPU’s

I firmly believe using GeForce cards for scientific computation is OK especially for workstations. However, I should note that Tesla model cards from NVIDIA are designed for compute and offer some additional features and are a better choice for multi-node cluster applications. NVIDA will be releasing a PCIe version of Pascal based Tesla later this year. These cards should have the same 1:2 ratio single to double precision performance of the GP100 SMX modules. The GeForce Pascal cards have very poor double precision performance. (However, most GPU accelerated application make very good use of single precision floating point on the GPU!)

Newer GeForce cards have very good thermal and power design. Our experience using cards from top tier vendors like EVGA and ASUS is that they have been excellent with very low failure rates even under heavy computational load. It used to be that the most important consideration when picking cards for compute was to avoid anything that was overclocked. These days it’s actually hard to find any cards that are not overclocked! Manufactures will by default overclock cards that are well within design specs. That makes me nervous because I’ve been doing this for a long time, but I concede that the newer cards like the Maxwell and Pascal GeForce cards are excellent designs. I would not be too concerned about overclocked cards but I would still probably recommend avoiding “superclocked” cards. [ I think “superclocked” is the new “overclocked” ] Video cards from top tier manufacturers with good cooling hardware will give good performance and hold up well to heavy load. They are also inexpensive enough that if they show any sign of failure or inconsistency you should plan to replace them without hesitation. Budget for that! My personal expectation when using GeForce cards for compute is to assume you will be replacing some cards in 6 to 9 months if they are under constant heavy load. They may very well hold up for several years and by then you will be replacing them with faster cards anyway!

The GTX 1080 and 1070 cards used in this testing are “Founders Edition” cards

Video cards used for testing. ( data from nvidia-smi )

Card	CUDA cores	GPU clock MHz	Memory clock MHz*	Application clock MHz**	FB Memory MiB
GTX 1070	1920	1506	4004	1506	8110
GTX 1080	2560	1607	5005	1607	8192
TITAN X	3072	1392	3505	1000	12287

Notes: * Marketing magic often reports twice that number as MT/s, ** a.k.a. base clock

Testing Simulations

The test simulations data and configuration files can be downloaded from the NAMD utilities page. All jobs were run using the default configuration files (500 time steps).

ApoA1 benchmark [ apoa1.namd ]: Apolipoprotein A-I; 2,224 atoms, periodic, PME (Particle Mesh Ewald)
ATPase benchmark [ f1atpase.namd ]: Adenosine tri-phosphate (ATP) synthase; 92,224 atoms, periodic, PME
STMV benchmark [ stmv.namd ]: Satellite Tobacco Mosaic Virus; 1,066,628 atoms, periodic, PME

Results

The numbers mostly speak for themselves but do read the “Notes:” section at the end of each table.

There are results for each of the test platforms for “CPU only” and “CPU + GPU” job runs. NAMD is essentially linearly scaling with CPU core count. Also, hyperthreading improved performance in every case for CPU only job runs. However, with GPU accelerated job runs hyperthreading slowed job run times.

The GPU acceleration for NAMD is very good. Adding nearly any NVIDIA CUDA capable GPU will significantly improve performance. There is diminishing returns when the GPU capability exceeds the CPU’s ability to keep up. These results indicate that with fast modern GPU’s this version of NAMD ( 2.11 ) is mostly CPU bound.

Caveats:

Heavy compute on GeForce cards can shorten their lifetime! I believe it is perfectly fine to use these cards but keep in mind that you may fry one now and then!

The numbers should not be taken as definitive benchmark results! There can be considerable variation in both runtime and the all important day/ns numbers. The “days per nano-second” has the most variability since I just used the number that was reported at the last “benchmark” phase of each job run and that is not necessarily the best result of all of the “benchmark” reports during the job run. Also, these jobs were only run for 500 time steps a “real” job would run significantly longer.

Peak Tower Dual –Xeon (2) E5 2687W v4 12-core
@ 3.0GHz ( 3.2GHz )[CPU and GPU Acceleration Results]

	apoa1 wall time	day/ns	f1atpase wall time	day/ns	stmv wall time	day/ns
CPU	19.7	0.370	55.7	1.09	181.4	3.73
Titan X	5.67	0.0954	16.4	0.289	48.3	0.851
(2) Titan X	4.27	0.0593	11.9	0.168	36.7	0.548
GTX 1070	4.94	0.0757	14.7	0.246	56.0	0.796
(2) GTX 1070	4.19	0.0477	11.7	0.154	36.4	0.532
GTX 1080	4.45	0.0653	13.1	0.207	40.3	0.652
(2)GTX 1080	4.08	0.0472	11.7	0.147	35.4	0.504

Notes:: Hyperthreading was enabled for the CPU results ( 48 HT core ). For the GPU results only “real” cores were used ( 24 cores ).; It is notable that the GTX 1070 outperformed the Titan X

Peak Tower Single –Xeon E5 2690 v4 14-core
@ 2.6GHz ( 3.2GHz )[CPU and GPU Acceleration Results]

	apoa1 wall time	day/ns	f1atpase wall time	day/ns	stmv wall time	day/ns
CPU	31.3	0.629	88.0	1.83	297.2	6.33
Titan X	5.48	0.0896	17.0	0.278	52.5	0.944
(2) Titan X	5.29	0.0749	15.8	0.237	47.7	0.808
GTX 1070	5.25	0.0785	16.3	0.261	51.4	0.908
(2) GTX 1070	5.19	0.0708	16.1	0.238	47.9	0.811
GTX 1080	5.11	0.0731	16.1	0.243	48.5	0.831
(2)GTX 1080	5.13	0.0716	16.2	0.239	47.7	0.809

Notes:: Hyperthreading was enabled for the CPU results ( 28 HT core ). For the GPU results only “real” cores were used ( 14 cores ).; These results are CPU bound when GPU acceleration is used!

Peak Tower Single –Core-i7 6900K 8-core
@ 3.2GHz ( 3.5GHz )[CPU and GPU Acceleration Results]

	apoa1 wall time	day/ns	f1atpase wall time	day/ns	stmv wall time	day/ns
CPU	45.8	0.975	134.0	2.92	457.0	10.0
Titan X	6.16	0.102	20.4	0.338	62.4	1.14
(2) Titan X	6.44	0.102	20.4	0.338	62.6	1.13
GTX 1070	6.26	0.103	20.5	0.342	62.2	1.13
(2) GTX 1070	6.38	0.102	20.6	0.338	62.7	1.13
GTX 1080	6.41	0.102	20.5	0.343	61.9	1.12
(2)GTX 1080	6.43	0.102	20.5	0.337	63.2	1.12

Notes:: Hyperthreading was enabled for the CPU results ( 16 HT core ). For the GPU results only “real” cores were used ( 8 cores ).; These results are CPU bound when GPU acceleration is used!

Conclusions and recommendations

Running NAMD with GPU acceleration can increase performance by a factor of 8-10 over CPU alone! This is enough performance to facilitate moderate sized MD simulations to be run in a reasonable amount of time on a single node workstation. The folks at UUIC are constantly working on NAMD and I’m sure that future versions will have more of the work load moving to the GPU since there is still significant performance to be gained. Even though the Peak systems can accommodate 4 X16 GPU’s it seems that more than 2 ( or even just 2 ) may be more GPU compute performance than what the CPU’s can keep up with. I think either a Peak Single with a GTX 1080 or a Peak Dual with 2 GTX 1080’s or 2 GTX 1070’s with as much CPU as your budget will allow would be excellent for NAMD at this point in time.

Happy computing –dbk

Tags: CUDA, GPU, GTX 1080, Molecular Dynamics, NAMD