AMD Threadripper 3970x Compute Performance Linpack and NAMD

Introduction

AMD Threadripper 3970x 32-core! …The, third new AMD processor I've had the pleasure of trying recently. I'm running it through the same double precision floating point performance tests as the recently tested Ryzen processors, Linpack and NAMD.

I've recently done testing with Ryzen 3950x 16-core, "AMD Ryzen 3950x Compute Performance Linpack and NAMD" and with Ryzen 3900x 12-core. "AMD 3900X (Brief) Compute Performance Linpack and NAMD" Performance has been very impressive. The Ryzen 3950x and TR 3970x have benefited from the much improved AMD BLIS (BLAS) library v2.0.

In the past I was pretty impressed with performance but was wishing that there was a more optimal BLAS library for the Zen2 architecture. There is now a version 2.0 of the AMD "BLIS" library and it gives significantly better performance with Linpack than v1.3 that was used in older posts.

This post revisits the recent Ryzen posts and adds in new results for the Threadripper 3970x I'm including NAMD Molecular Dynamics results for my usual test molecule, STMV as well as a smaller molecular system, ApoA1. ApoA1 seems to be a popular system for benchmarking on CPU with NAMD. GPU acceleration results are reported for the ApoA1 job results.

System Configuration

Hardware:

(see the 2 posts linked in the Introduction for the Ryzen configurations)

  • AMD Threadripper 3970x
  • Motherboard Gigabyte TRX40 AORUS EXTREME
  • Memory 8x DDR4-2933 16GB (128GB total)
  • 1TB Samsung 960 EVO NVMe M.2
  • NVIDIA RTX 2080Ti GPU

Software:

Notes:

  • I used Ubuntu 18.04 for this testing rather than 19.10 that was used the the Ryzen testing. I had wanted to use 19.10 in order to have newer libs and kernel, but there was some motherboard issues that kept it from booting and not enough time to sort it out. !8.04 installed and worked fine with the TR 3970x.
  • New results in this post are for Threadripper 3970x only. The other results are from previous testing.

Linpack

Notes:

  • The pre-built muit-threaded HPL binary provided by AMD worked well so I didn't bother rebuilding from source. This is the "MT" build but it still looks for MPI header files on start-up and uses the HPL.dat file for job run configuration.
  • AMD BLIS (a.k.a. AMD's BLAS library) has been updated to version 2.0 with specific support for Zen2.
  • Several combinations with MPI ranks together with OMP threads were tried. The best results obtained were using only OMP threads and the pre-built binary without MPI. 1 OMP thread per "real" core i.e. 32 OMP processes gave the best result.
  • There is a detailed description of HPL Linpack testing for Threadripper 2990WX in the post, How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper — 2990WX 32-core Performance The 2990WX testing in this post and the result presented could probably be improved with the new BLIS lib.
  • The Intel CPU's were tested with the (highly) optimized Linpack benchmark program included with Intel MKL performance library.
  • A large problem size approx. 90% of available memory (128GB) was used in order to maximize performance results, Ns=114000.
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR12R2R4      114000   768     1     1             746.42             1.3233e+03
HPL_pdgesv() start time Sat Nov 23 12:20:42 2019

HPL_pdgesv() end time   Sat Nov 23 12:33:08 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.04947690e-03 ...... PASSED
================================================================================

Here is an HPL.dat file used, [this file automates using 3 problems sizes (Ns) and 3 Block sizes (NBs), also note that P and Q are set to 1 i.e. 1 MPI Rank, parallelism was from OMP threads]

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
112000 113000 114000 Ns
3            # of NBs
256 512 768  NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
...

The following environment variables were set for the Ryzen Linpack runs

export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=32   (16 for 3950x ...)

The AMD Threadripper 3970x gave better than expected results!

The following plot shows HPL Linpack results (in GFLOPS) Best results fro Ryzen were with Ns=114000 and NB=768.

TR3970X Linpack

The TR3970x results are exceptionally good for a processors with AVX2!

The Intel processors with AVX-512 vector units have a big advantage for Linpack. Also,the Linpack used for the Intel processors is built with the BLAS library from Intel's excellent MKL (Math Kernel Library).

NAMD

Now on to the real world! … sort of … NAMD is one of my favorite programs to use for benchmarking because it has great parallel scaling across cores (and cluster nodes). It does not significantly benefit from linking with the Intel MKL library and it runs on a wide variety of hardware and OS platforms. It's also a very important Molecular Dynamics research program.

When I said "sort of" above I'm referring to the fact that NAMD also has very good GPU acceleration. Adding CUDA capable GPU's will increase throughput by an order of magnitude. However, with NAMD and other codes like it, only a portion of the heavy compute can be offloaded to GPU. A good CPU is necessary to achieved balanced performance. I like NAMD as a CPU benchmark because I believe it is an excellent representative of scientific applications and reflects performance characteristic of many other programs in this domain.

This plot shows the performance of a molecular dynamics simulation on the million atom "stmv" ( satellite tobacco mosaic virus ). These job runs are with CPU only. Performance is in "day/ns" (days to compute a nano second of simulation time) This the standard output for NAMD. If you prefer ns/day then just take the reciprocal.

NAMD Ryzen 3950X

The Threadripper 3970x gave excellent performance, significantly improving the already great result with last generation Threadripper 2990WX.

This last set of results is using the smaller ApoA1 problem (it's still pretty big with 92000 atoms!)

I ran this job for two reasons; 1) to show how well the TR3970x does compared to the Xeon-W 2175 14-core AND to provide a reality check about how much adding a GPU can increase performance for programs that have good GPU acceleration. Adding the NVIDIA 2080Ti GPU's increased performance by over a factor of ten!

TR 3970x + 2080Ti NAMD ApoA1

The 32-core TR3970x is very good with this Molecular Dynamics code! The upcoming 64-core version should scale perfectly and provide the needed CPU performance to keep up with the code running on multiple NVIDIA 2080Ti's. My guess is that the upcoming TR3990x 64-core CPU together with 2-4 NVIDIA 2080Ti's will "set the bar" for performance as a workstation platform for this class of applications. I'm looking forward to testing that!

Conclusion

At this point I really don't have any serious reservation in recommending any of the new AMD Zen 2 core based processors for compute intensive workloads. They give excellent performance and value. This is not to say that I have reservations recommending the new Intel Core-X or Xeon-W processors in this price class. The Intel processors come with a solid mature platform and also offer excellent performance (and now, with the new price cuts, good value). Intel also has the advantage of a very strong development ecosystem. You can get great performance with the new AMD processors but but to achieve that performance there may be times when you have to do a little extra work recompiling code or things like that.

I will be doing more CPU testing in a few week after all of this years new processors from Intel and AMD are released. So, expect another post with LOTS of new CPU's in it!

Happy computing! –dbk @dbkinghorn


CTA Image
Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.

Configure a System!
CTA Image
Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

gears icon

Built Specifically for You

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

people icon

We’re Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

delivery icon

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry-leading ship time.

repair icon

Lifetime Labor & Tech Support

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.
Click here for even more reasons!