Intel Scalable Processors Xeon Skylake-SP (Purley) Buyers Guide


Intel Purley platform, Skylake-SP, Xeon “Scalable” processors (Platinum, Gold, Sliver, Bronze) are here. All 58 of them! How are you going to sort out that mess? Well, hopefully this post will help.

I’ll try to sort these processors out by “use-case” to come up with a more manageable list of choices. I’ll trim the list of processors and do a price performance analysis for different usage scenarios. … What do I mean by “use-case”?

Use-case Performance by Program Execution Characteristics

Application software will have some kind of performance limiting behavior. This is generally determined by problem domain, algorithm choice, inherent parallel scalability, memory I/O demands, programmer skill, and choice of performance libraries and compiler tools. These performance determiners map more generally to compute hardware as follows,

Serial CPU (single thread)

Software that has no parallelization is common. No multi-threads, no vectorization. This is common for code in scripting languages like Python and programs that are not performance critical or legacy code that no one has bothered to optimize. [note that scripting languages like Python are typically used as front ends for code that is highly parallel, … like numpy linked to the MKL library ]

  • For these types of programs the most important CPU attribute is likely to be the Max Turbo-Clock Frequency and the Non-AVX All-Core-Turbo clock frequency if you are running many instances of these type of programs simultaneously. [possibly typical “business” server applications like web apps email etc.. ]

Multi-Threaded Non-Vectorized

This large class of programs have a multi-threaded implementation but do not take advantage of vectorization. These are programs that may have “task parallelism” but are not making heavy use of matrix vector math operations. These programs may have limited scalability unless they are embarrassingly parallel.

  • For embarrassingly parallel programs that have long run-times lots of cores are usually an advantage. Number of cores and and Non-AVX All-Core-Turbo frequency are important.

  • Many programs in this class will have limited scalability. They may even be “hard-coded” to a limited number of threads. In this case fewer numbers of cores and high All-Core-Turbo and high Non-AVX All-Core-Turbo would be good.

Multi-Threaded and Vectorized

This is the ideal case for utilizing the features of these new processors. In this case high core-count and AVX-512 All-Core-Turbo (or AVX2 All-Core-Turbo for software that has not been updated) will likely determine performance. These are programs typical of HPC workloads like simulation and machine learning. Programs that make heavy use of matrix vector math operations should perform very well when utilizing AVX-512 and FMA operations. This of course assumes that the programmer has optimized their code! Programs that make library calls to Intel’s MKL (Mathe Kernel Library) or DAAL (Data Analytics Acceleration Library) should give excellent performance on the new Xeon processors.

Note: when the AVX vector units are under load the CPU clock frequency can be significantly lowered! This is true for AVX2 and especially significant for AVX-512.

Memory I/O Bound

This is one of the most disturbing cases. I occasionally hear someone say “I got the latest, best, hardware and my program doesn’t run any faster!” If the software is inherently I/O bound or the software is poorly written with bad memory layout, having lots of cache misses etc. it may be difficult to improve performance with “better” hardware. In some cases large per-core cache may help. Also, keep in mind that the L3 Cache (Last Level Cache) is shared across all cores. Therefore a high core-count CPU with a large total L3 Cache may help. If job run feasibility is limited by the amount of available memory then the Xeon processors that are marked by a trailing M will support up to 1.5TB memory each (I will not consider these processors in my analysis).

Who Cares about the CPU!

Lets face it, there are a lot of important modern programs, frameworks, and libraries that get their performance from GPU Acceleration. In particular programs that utilize NVIDIA CUDA on Tesla or GeForce graphics cards can out perform their CPU based counter-part by an order of magnitude! Some of these programs will still need good CPU support for sections of the code that cannot be accelerated on the GPU. Those code sections will fall into one of the classes listed above. For programs that get nearly all of their performance from the GPU you just need enough CPU to support the GPU’s

Let’s now start the process of weeding through this big pile of new processors.


Trimming the List of Processors (initial)

There are too many Intel Xeon Scalable processor product ID’s.

Table of all 58 Processors Numbers

8180M 8180 8176M 8176F 8176 8170M 8170 8168 8164 8160T
8160M 8160F 8160 8158 8156 8153 6154 6152 6150 6148F
6148 6146 6144 6142M 6142F 6142 6140M 6140 6138T 6138F
6138 6136 6134M 6134 6132 6130T 6130F 6130 6128 6126T
6126F 6126 5122 5120T 5120 5119T 5118 5115 4116T 4116
4114T 4114 4112 4110 4109T 4108 3106 3104

There they are. Which one do you want?!

Intel has specified these processors into “metal” classes, Platinum (81xx), Gold (61xx), Silver (51xx), and Bronze (31xx).

Lets try to make this a bit more manageable. There are a lot of processors that can be eliminated from consideration ( for our purposes ).

To start, note that some of the processors have a letter at the end of their product ID number (M,F,T). These represent,

  • M — Large memory capable, up to 1.5TB (“normal” is 768GB)

  • F — Intel Omni-Path fabric (high speed network fabric)

  • T — Thermal optimized ( for 10 year life cycle )

M we might occasionaly be interested in, but F and T we can eliminate for sure. Let’s also eliminate M since those processors are much more expensive than their “normal” version. They would also have the same performance as the “normal” version unless you really needed the large memory capacity. We can also remove the 2 “Bronze” processors since they are (lame) 1.7GHz with no Turbo-Boost and no Hyper-Threading. That gets rid of 25 processors from the list.

There are 2 more “redundant” processors. 8156 and 8158 are the same as 5122 and 6136 except the 8156 and 8158 processors allow up to 8 socket systems as compared to a maximum of 4 sockets for the 5122 and 6136. Also, the 8156 and 8158 both cost $7007 while the 5122 is $1221 and the 6136 is $2460. I don’t think you want to pay an extra $6000 per processor unless you really, really have to have an 8 socket system! That brings our elimination count up to 27 … I think we can eliminate a few more …

I’ll drop the processors that have a Non-AVX All-Core-Turbo clock frequency less than 2.5GHz. They drop down to 1.6GHz or less for AVX-512 All-Core-Turbo. Those are 4108, 4110, 4116 and 8153. That gives us 31 to eliminate leaving 27 to look at. That is still too many but we’ll need to look at use-case price performance to find the best processors. Here’s processor ID’s we will look at in more detail.

8180 8176 8170 8168 8164 8160 6154 6152 6150 6148
6146 6144 6142 6140 6138 6136 6134 6132 6130 6128
6126 5122 5120 5118 5115 4114 4112

Data Analysis Price Performance and Use-Case

Performance measure

The first thing we need is a discriminatory metric to differentiate the processor performance. I’m going to use the following approximation to the theoretical performance,

Performance = Cores x TurboFreq x VecWidth x #FMA

Where,

  • Cores — is the number of CPU cores (not considering Hyper-Threads)

  • TurboFreq — is the relevant clock frequency (more on that below)

  • VecWidth — is the AVX vector width for double precision floating point numbers ( It will be 4 for AVX2 and 8 for AVX-512 ).

  • #FMA — This will be the number of FMA AVX units (Fast Multiply Add) A few of the lower performance processors have 1 FMA unit and the rest have 2.

This would give a number that could be interpreted as GFLOP/s (Billions of floating point operations per second). This is not a very good estimate of peak floating point performance for these complicated processors but it will serve our needs as a performance measure. [We’ll also consider cache size for memory bound application]

CPU Clock Frequencies

There are 5 different CPU clocks for these new Xeon processors!

  • Base Clock — This is the clock frequency that the processors would run at if all “Turbo-Boost” and “power management” was disabled in the system BIOS. This would be the frequency that would achieve the TDP power draw for the processor. It is basically useless information for most end users. It is not used for any performance estimation. However, it is typically used as part of the label for the CPU’s.

  • Max Turbo Frequency — This is the highest clock for the processor and is generally achieved when up to 2 processes are running on a many core processor i.e the other cores are idle. This number is important for non-parallel (serial) jobs that may be may be important in routine work. It’s the clock that determines how “snappy” your system feels.

  • Non-AVX All-Core-TurboThis is the most important clock frequency for the CPU. This is the maximum clock that the CPU can run all of it cores at when the AVX vector units are not being utilized. Programs that are not optimized (or optimal) for matrix vector operations but that do have good multi-threaded scaling will likely be performance limited by this clock. This also applies to work-flows that require many simultaneous application programs.

  • AVX2 All-Core-TurboYes!, when the AVX units are active the CPU core clocks decrease, and they decrease differently for AVX2 and the newer, twice the bit-width, AVX-512 vector units. The Xeon Scalable processors support SSE4.2, AVX, AVX2 and AVX-512 vector operations. These operations can have a huge impact on well optimized programs that make heavy use of matrix vector math. This is the performance that I personally judge processors by. It’s the performance that is exposed by the (Intel optimized) Linpack benchmark. The vector units can give a performance boost of from 4 to 16 fold but it doesn’t come for free. It requires a lot of power draw to run those parts of the CPU. There is a limited amount of power that can be safely run through a processor, and to achieve this, the CPU has to clock down it’s core frequency in most cases.

  • AVX-512 All-Core-Turbo — When these high performance vector units are loaded it is the worst case for the power draw on the processor. This vector unit can have a large throttling effect on the core clock frequency.

Intel has worked hard to optimize the maximum performance they can get out of the design for these new Xeon’s. There is variability in the quality of the chips and they try to get the most performance they can. By doing this they end up with a lot of subtly (or not so subtly) different processors. They are trying to not waist silicon but they really have produced way too many processors in my opinion.

Lets look at some of the important numbers for these processors and then we’ll look at price performance plots for different software characteristics.

Differentiating Processor Data

The following table has data for each of the processors we are considering without including data that is common to all of them.

Processor ID Price Cores Base
Clock
Max
Turbo
All
Core
AVX2 AVX512 #FMA Cache Cache
per Core
Mem
Clock
TDP
8180 10009 28 2.50 3.80 3.2 2.8 2.3 2 38.5 1.375 2666 205
8176 8719 28 2.10 3.80 2.8 2.4 1.9 2 38.5 1.375 2666 165
8170 7411 26 2.10 3.70 2.8 2.4 1.9 2 35.75 1.375 2666 165
8168 5890 24 2.70 3.70 3.4 3.0 2.5 2 33 1.375 2666 205
8164 6120 26 2.00 3.70 2.7 2.3 1.8 2 35.75 1.375 2666 150
8160 4708 24 2.10 3.70 2.8 2.5 2.0 2 33 1.375 2666 150
8153 3115 16 2.00 2.80 2.3 2.0 1.6 2 22 1.375 2666 125
6154 3543 18 3.00 3.70 3.7 3.3 2.7 2 24.75 1.375 2666 200
6152 3661 22 2.10 3.70 2.8 2.4 2.0 2 30.25 1.375 2666 140
6150 3358 18 2.70 3.70 3.4 3.0 2.5 2 24.75 1.375 2666 165
6148 3078 20 2.40 3.70 3.1 2.6 2.2 2 27.5 1.375 2666 150
6146 3286 12 3.20 4.20 3.9 3.3 2.7 2 24.75 2.0625 2666 165
6144 2925 8 3.50 4.20 4.1 3.5 2.8 2 24.75 3.094 2666 150
6142 2952 16 2.60 3.70 3.3 2.9 2.2 2 22 1.375 2666 150
6140 2451 18 2.30 3.70 3.0 2.6 2.1 2 24.75 1.375 2666 140
6138 2618 20 2.00 3.70 2.7 2.3 1.9 2 27.5 1.375 2666 125
6136 2460 12 3.00 3.70 3.6 3.3 2.7 2 24.75 2.0625 2666 150
6134 2220 8 3.20 3.70 3.7 3.4 2.7 2 24.75 3.094 2666 130
6132 2111 14 2.60 3.70 3.3 2.9 2.3 2 19.25 1.375 2666 140
6130 1900 16 2.10 3.70 2.8 2.4 1.9 2 22 1.375 2666 125
6128 1697 6 3.40 3.70 3.7 3.6 2.9 2 19.25 3.208 2666 115
6126 1776 12 2.60 3.70 3.3 2.9 2.3 2 19.25 1.604 2666 125
5122 1227 4 3.60 3.70 3.7 3.6 3.3 2 16.5 4.125 2666 105
5120 1561 14 2.20 3.20 2.6 2.2 1.6 1 19.25 1.375 2400 105
5118 1273 12 2.30 3.20 2.7 2.3 1.6 1 16.5 1.375 2400 105
5115 1221 10 2.40 3.20 2.8 2.4 1.6 1 13.75 1.375 2400 85
4114 704 10 2.20 3.00 2.5 2.2 1.4 1 13.75 1.375 2400 85
4112 483 4 2.60 3.00 2.9 2.6 1.4 1 8.25 2.0625 2400 85

Notes: The price is is Intel’s suggested price in USD. The CPU clocks are in GHz, the Memory clock is in MHz, the Cache sizes are in MB, and the TDP is in Watts.

I like looking at numbers but I will have several plots below to look at to find the best choice processors. A few processors in the table standout to me because of their obviously interesting features.

  • The 4112 is the least expensive and could be a good choice when the CPU doesn’t matter that much i.e. maybe it’s just there to support 4 or 8 NVIDIA GPU’s for compute.

  • The 5122 stands out as the processor with the highest AVX-512 All-Core-Turbo. It is just 4 cores but they are all running full speed. It also has the largest per core Cache. This could be a good processor for memory bound programs and/or programs that don’t have good parallel scaling but do take advantage of AVX512 vectorization.

  • The 6144 and 6146 have a high Max-Turbo and All-Core-Turbo with large Cache per Core.

  • The 8168 stands out as a good value for high core count and it has good AVX-512 All-Core-Turbo too.

Lets see what the plots have to show us.


Price vs Relative Performance Plots

I’m going to start with the Price vs Relative Performance plots for the best case software. That’s software that is highly parallel (Multi-threaded) and highly vectorized for AVX-512. This is the software that would expose the best performance of the processors.

Higher performance is to the right and higher price is toward the top in the plots

AVX512 price vs perfromance

There are 11 processors that don’t look very attractive in this plot and I have the advantage of knowing that they don’t look very good in any of the other many plots that I made. I am going to remove these from the data set and redo this plot and use the reduced dataset for the remainder of the plots.

[Removing 8180, 8176, 8170, 8164, 8160, 6152, 6150, 8153, 5115, 5118, 5120]

The next plot is excluding these processors.

AVX512 price vs perfromance 2

This plot gives a clearer picture of the relative performance. The trend and reative placement of the processors differs only slighty for the AVX2 All-Core-Turbo case so we wont include that plot.

The next plot is using the All-Core-Turbo frequency scaled by the per-core Cache. This give more weight to the processors that may do better for jobs that are memory bound.

Big per-core Cache price vs perfromance 2

This plot forward the high clock processors with large memory cache.

The last plot is the using the All-Core-Turbo frequency. This has a similar distribution to the plot including AVX-512 but does shift a few processors around without making large changes in relative “value”.

All-Core-Turbo price vs perfromance 2

From these last 3 plots the following processors stand out as offering good performance and value for a variety of use-cases. I’ll now try to break down that usage.

[8168, 6154, 6148, 6144, 6140, 6134, 6130, 6128, 6126, 5122 4114 4112]


My picks for best Xeon Scalable Skylake-SP processors (by Usage)

Processor ID Price Cores Base
Clock
Max
Turbo
All
Core
AVX2 AVX512 #FMA Cache Cache
per Core
Mem
Clock
TDP
8168 5890 24 2.70 3.70 3.4 3.0 2.5 2 33 1.375 2666 205
6154 3543 18 3.00 3.70 3.7 3.3 2.7 2 24.75 1.375 2666 200
6148 3078 20 2.40 3.70 3.1 2.6 2.2 2 27.5 1.375 2666 150
6144 2925 8 3.50 4.20 4.1 3.5 2.8 2 24.75 3.094 2666 150
6140 2451 18 2.30 3.70 3.0 2.6 2.1 2 24.75 1.375 2666 140
6134 2220 8 3.20 3.70 3.7 3.4 2.7 2 24.75 3.094 2666 130
6130 1900 16 2.10 3.70 2.8 2.4 1.9 2 22 1.375 2666 125
6128 1697 6 3.40 3.70 3.7 3.6 2.9 2 19.25 3.208 2666 115
6126 1776 12 2.60 3.70 3.3 2.9 2.3 2 19.25 1.604 2666 125
5122 1227 4 3.60 3.70 3.7 3.6 3.3 2 16.5 4.125 2666 105
4114 704 10 2.20 3.00 2.5 2.2 1.4 1 13.75 1.375 2400 85
4112 483 4 2.60 3.00 2.9 2.6 1.4 1 8.25 2.0625 2400 85

Notes: The price is is Intel’s suggested price in USD. The CPU clocks are in GHz, the Memory clock is in MHz, the Cache sizes are in MB, and the TDP is in Watts.

Here’s a breakdown along the use-case ides I presented near the top of the post.

Serial CPU (Single Thread)

  • 6144, 6128, 5122 These have 8, 6 and 4 cores with high Turbo frequencies and larger caches. They will all give excellent single thread performance and they maintain respectable clock frequency if many job are run at the same time. They would also have good parallel performance for jobs that had limited scalability.

Multi-Threadded (Non-Vectorized)

  • 8168 6154, 6148, 6140, 6130, 6126, 4114 These have 24, 18, 20, 18, 16, 12, and 10 cores. They have good All-Core-Turbo speeds. My favorite would be a dual 6154, it has great performance and value.

Mulit-Threaded Vectorized (Highly Optimized Software)

  • 8168, 6154, 6148, 6140, 6130, 6126 These are mostly the same as above! These processors all have good AVX-512 All-Core-Turbo frequency and high core count.

Memory I/O Bound

  • 6144, 6134, 5122 or 8168, 6148, 6140 The first group has large Cache per Core, and the second has higher core count and larger overall shared L3 Cache.

CPU doesn’t Matter

  • 6128, 5122, 4114, 4112 The first 2 would be excellent for supporting a system using GPU’s for compute where there is still need for fast CPU processing and fast memory transport to and from the GPU’s. The last 2 are simply the lowest cost processors on my list and would be good when CPU really doesn’t matter much.

There you have it! Those are my picks and recommendation. You may have a different opinion! Hopefully the charts and tables give you something to consider if you have a use case that I didn’t include.

One thing that I didn’t discuss here is parallel scaling and Amdahl’s Law. This is important! I recommend that you look a the post I did for the Broadwell Xeons. There is an interactive chart in that post that painfully shows how less-than-perfect scaling can effect on high core count processor system performance. Intel Xeon E5 v4 Broadwell Buyers Guide (Parallel Performance)


  • The full list of Intel Xeon Scalable processors on Intel Ark. Pro tip: If you want a full spreadsheet with all of the Ark specifications, then click the check-box for “All” on the “compare tab”. You can then click “Compare” and you will see an option for “Export comparison”. That will give you an XML spreadsheet that you can load into something like Excel.

  • For technical details on the processors and errata see “Intel Xeon Processor Scalable
    Family Specification Update”
    . This pdf document has loads of information including tables of all of the different clock frequencies far any number of active cores for all for the processors.

Happy computing! –dbk