LAMMPS Optimized for Intel on Quad Socket Xeon

LAMMPS

LAMMPS is a high quality, well parallelized collection of packages for molecular dynamics (MD) research, (Large-scale Atomic/Molecular Massively Parallel Simulator). It has a nice collection of “atom styles” and force fields and a large number of contributed packages. It’s open source under the GPL. LAMMPS can run on a single processor or the largest parallel super-computers and has packages that provide force calculations accelerated on GPU’s, and recently, Xeon Phi. It is capable of doing simulations with billions of atoms!

You can find information for obtaining LAMMPS from the download page There are prebuilt packages for Ubuntu (deb) and RPM based Linux distros and OS/X and Windows too! It is probably most common for a LAMMPS installation to be built from source since there are a large number of optional packages and features that can be tailored to a variety of uses cases and hardware platforms.

Now before we go any further I would like to include the following quote from the Getting Started section in the docs;

Read this first: Building LAMMPS can be non-trivial. You may need to edit a makefile, there are compiler options to consider, additional libraries can be used (MPI, FFT, JPEG, PNG), LAMMPS packages may be included or excluded, some of these packages use auxiliary libraries which need to be pre-built, etc. Please read this section carefully. If you are not comfortable with makefiles, or building codes on a Unix platform, or running an MPI job on your machine, please find a local expert to help you. Many compiling, linking, and run problems that users have are often not LAMMPS issues – they are peculiar to the user's system, compilers, libraries, etc. Such questions are better answered by a local expert.

This is a typical situation with scientific research codes — getting a good installation running requires system administration skills AND an understanding of the software usage. Usually those skills are possessed by different people i.e. sysadmins know how to configure and compile codes for the hardware that they are in charge of but not what needs to be built for the jobs to be run and the users know what needs to built for their work but not how to optimized the installation. You are usually best making this a team effort between a skilled admin and a knowledgeable user.

Since I was interested in some of the latest code ( some of which was added in the middle of the week that I was doing my testing! ) I cloned a fresh copy of the source from their git repository.

git clone https://github.com/lammps/lammps lammps

When compiling LAMMPS It’s really easy to get sucked into playing with makefiles and trying different compilers and build options and package combinations etc.. It’s generally a well behaved code base and you can be rewarded for your efforts. My initial interest was mainly in testing the code on our quad Xeon machine. (I’ll be doing GPU and Phi testing along with other systems in some other posts). I played with Intel compilers, gcc, different FFT libraries etc.. In the middle of all that fun, there was an announcement of a package USER-INTEL. This package provides some well optimized atom pair styles that performed significantly better than the builds I was doing ( using the Intel compilers ). If you couple this package with USER-OMP which has some Intel optimizations for bends and dihedral styles you get a good performing build on Intel hardware when using the command line switch “-suffix intel”

The package USER-INTEL also add a very useful makefile “Makefile.intel”. I installed Intel MPI, impi, and used this makefile without modification. ( I wish Intel would bundle impi with Parallel Studio! It is a good MPICH based MPI with thread safe libraries for use with hybrid openMP/MPI builds. It’s very useful even on a single SMP node. )

Here’s my build lines;

[kinghorn@tbench src]$ make yes-molecule yes-granular yes-manybody yes-misc yes-rigid yes-kspace
Installing package molecule
Installing package granular
Installing package manybody
Installing package misc
Installing package rigid
Installing package kspace


[kinghorn@tbench src]$ make yes-user-intel yes-user-omp
Installing package user-intel
Installing package user-omp


[kinghorn@tbench src]$ make -j20 intel

That should get you a build sufficiently rich enough to run all the benchmark jobs in the /bench directory.

Note: Here’s a fix for the broken input files in.chute and in.chute.scaled;

#communicate    single vel yes
comm_modify     vel yes

In order to be able to utilize the Intel optimized code you will want to add the following at the top of the input files;

package intel * 
package omp *
suffix $s

and then enable the intel "suffix" on the command line with

-v s intel 

That will get you the nicely optimized code from the USER-INTEL package and automatically fall over to the Intel optimisations in the USER-OMP package for other potential terms etc.. You can then, for example, run a 256000 atom Rhdopsin benchmark job like the following;

mpiexec  -np 40 ../src/lmp_intel -in in.rhodo.scaled -var x 2 -var y 2 -var z 2 -v s intel

To disable the suffix so you can see the effect of the optimizations you can use,

-v s none

Test System

  • Puget Systems Peak Quad Xeon:

    • 4 x Intel Xeon E5-4624L v2 @1.9GHz 10-core
    • 64GB DDR3 1600 Reg ECC

Software

  • LAMMPS source: LAMMPS_VERSION "15 Aug 2014"
  • Linux CentOS 6.5
  • Intel Parallel Studio XE 2013 SP1 update3 with Intel impi 5.0.0

Note: "update1" was recommended by the author of USER-INTEL for "performance reasons", however, I got good results with update3. I think the issues may be with the build for off-load to Xeon Phi.

Results

We will use two jobs from the /bench directory, the Rhodopsin protein job and a million atom run of the Lennard-Jones Liquid job. Both jobs will be run for 100 time steps. The Rhodopsin job will be run with both strong and weak scaling. Lennard-Jones Liquid job will be run with strong scaling.

Note: Strong scaling is running with a fixed problem size and increasing parallel processes, i.e. Amdahl’s law type of scaling. Weak scaling is increasing problem size with increasing parallel processes, i.e. like Gustafson-Barsis’ law.

  • million atom timesteps/sec = (# of atoms) * timesteps / runtime * 10^-6
  • speedup = (runtime for 1 core) / (runtime for n cores)
  • parallel efficiency % (strong scaling) = 1/cores * speedup * 100
  • parallel efficiency % (weak scaling) = (runtime at 1 core problem size) / (runtime at n core problem size) * 100
  • % improvement with suffix intel = ( 1 – (runtime intel) / (runtime none) ) * 100

Quad Xeon LAMMPS parallel performance, Rhodopsin strong scaling 256000 atoms, 100 time steps

 
cores Million atom timesteps/sec Speedup Parallel efficiency % RunTime (sec) – suffix intel RunTime (sec) – suffix none % improvement with suffix intel
40 3.02 28.15 70.4 8.47 11.14 24.0
32 2.75 25.58 80.0 9.32 12.36 26.2
20 1.66 15.44 77.2 15.44 20.72 25.5
16 1.38 12.81 80.1 18.61 25.15 26.0
10 0.938 8.74 87.4 27.28 36.02 24.3
8 0.772 7.19 89.9 33.15 45.07 26.5
4 0.409 3.81 95.2 62.61 84.65 26.0
1 0.107 1 100 238.41 317.95 25.0

Notes: I didn't account for "turbo-boost" on the few core runs. That could skew the scaling somewhat. The code that is utilized from USER-INTEL gives a nice 25% speedup on this job and it does not include all of the formulas used in this calculation. Hopefully this package will be expanded to included more atom styles, potentials, etc..

Quad Xeon LAMMPS parallel performance, Rhodopsin weak scaling scaling, 32000 to 1280000 atoms, 100 time steps

 
cores Million atom timesteps/sec Atoms Parallel efficiency % RunTime (sec) – suffix intel RunTime (sec) – suffix none % improvement with suffix intel
40 3.16 1280K 74.2 40.47 54.49 25.7
32 2.77 1024K 81.1 37.01 49.60 25.4
20 1.73 640K 81.2 36.95 49.71 25.7
16 1.41 512K 82.9 36.21 49.06 22.0
10 0.931 320K 87.3 34.36 46.40 26.0
8 0.771 256K 90.3 33.22 45.30 26.7
4 0.406 128K 95.1 31.55 43.24 27.0
1 0.107 32K 100 30.01 40.13 25.2

Notes:Weak scaling shows similar scaling and performance to the strong scaling runs … I'm not sure if I am surprised by this or not! I am impressed that a dynamics problem with 1.28 million atoms runs in well under a minute on this machine!

The following "diff" shows the changes I made to the standard benchmark file in.lj to create the million atom job file with support for the Intel optimizations.

[kinghorn@tbench bench-intel]$ diff in.lj in.lj-mil
2,5c2,8
< 
< variable      x index 1
< variable      y index 1
< variable      z index 1
---
> package intel * 
> package omp *
> suffix $s
> 
> variable      x index 4
> variable      y index 4
> variable      z index 2

Quad Xeon LAMMPS parallel performance, Lennard-Jones water million atom (1024000) strong scaling,100 time steps

 
cores Million atom timesteps/sec Speedup Parallel efficiency % RunTime (sec) – suffix intel RunTime (sec) – suffix none % improvement with suffix intel
40 50.69 24.4 60.9 2.02 3.02 33.1
32 47.63 22.9 71.5 2.15 3.41 37.0
20 33.25 16.0 79.9 3.08 5.07 39.3
16 25.04 12.0 75.1 4.09 6.63 38.3
10 18.72 9.0 90.0 5.47 9.25 40.9
8 15.54 7.47 93.3 6.59 11.28 41.6
4 8.29 3.98 99.6 12.35 21.44 42.4
1 2.08 1 100 49.20 85.72 42.6

Notes: The code that is utilized from USER-INTEL gives a nice 40% speedup on this job!

The following plot presents the strong and weak scaling data for the Rhodopsin benchmarks. I have done curve fitting against Amdahl's and Gustafson's law. The Amdahl's law fit is better for both data sets and really neither one is that great since it is pretty obvious that performance is rolling off sharply after 32 cores. However, I've included the plot for "completeness".

  Amdahl(cores) = 0.107/((1-a) + (a/cores))
  Gustafson(cores) = .107 * ( cores - b*(cores-1) )
  performance scaled by 1-core "millions of timesteps/sec"

Discussion

LAMMPS is an important application in computational chemistry/biology and the problems used in these benchmarks are non-trivial, yet they completed remarkably quickly on this quad Xeon workstation. Real world calculation would likely be run for many more time-steps but even then the expected throughput is excellent. Keep in mind that this test machine is running at a fairly slow clock speed of 1.9GHz! Also, since the scaling falls off fairly sharply after 32 cores, my recommendation for a workstation for LAMMPS would be based around Xeon v2 processors with my (current), standard favorite, quad 8-core Xeon [email protected].

Happy computing! –dbk