The Intel Xeon E5-2600 v4 Broadwell processors are available now and I’ve got a dual 2687W v4 system on my desk! Time to see what performance is like.
Broadwell is the “Tick” to Haswell’s “Tock”. It’s a die shrink from 22nm to 14nm, with increases in core count and memory bandwidth and power handling. There are many new features and improvements,
- AVX Optimizations (power and clock improvements)
- Reduced cycle times for several floating point ops and improved division ops
- TLB and Gather op improvements
- TSX support
- Better cache management
- Higher maximum memory clocks
- … lots of good small changes
If you want details on the many subtle differences a google search will be fruitful 🙂 Lets see how these changes effect performance.
Test Hardware and Software
- Test System:Peak Tower Dual
- CPU: Intel Xeon E5 2687W v4 12-core @ 3.0/3.2/3.5GHz
- Memory: 256GB DDR4 2133MHz
- OS: CentOS 7.2
- ** CentOS 7.2 will not recognize the new Broadwell processor during install and will report “Unknown Hardware”. It does install OK and after updates the CPU is detected properly.
- Test programs:
- Linpack benchmark from Intel MKL version 11.3
- NAMD version 2.10
- STMV benchmark [ stmv.namd ]
- Satellite Tobacco Mosaic Virus
- 1,066,628 atoms, periodic, PME, CPU only
Note: NAMD is a molecular dynamics program developed and maintained by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign.
Note: Base clock is 3.0GHz, All-Core-Turbo is 3.2GHz and Max-Turbo is 3.5GHz. For the Haswell 2687v3 these clocks were 3.1GHz, 3.2GHz, and 3.5GHz. AND the number that really matters is All-Core-Turbo since that is what the processors run at under full with proper cooling.
Linpack benchmark on Xeon E5 v4 Broadwell
We’re running the Intel optimized Linpack binary contained in the MKL benchmarks directory from MKL version 11.3. This is the latest version of MKL as of this writing and this code is highly optimized for Intel processors. I feel it is a good measure of double precision floating point performance and is highly optimized for Intel architecture.
Broadwell Xeon E5 2687W v4 Linpack 1078 GFLOP/s!
That is outstanding performance from a single node dual socket system! By comparison the testing I did with its Haswell predecessor 2687W v3 10-core gave 788 GFLOP/s. Keep in mind that the v3 version had 2 fewer cores and I ran that test with a smaller (16GB) problem size, and an older version of MKL was used. However, also, keep in mind that the Broadwell v4 version is the same price as the Haswell v3 version of this processor!
Here’s some output from the Linpack testing
“Standard” problem sizes (up to ~ 16GB)
Mon May 16 18:09:08 EDT 2016
Intel(R) Optimized LINPACK Benchmark data
...
Number of CPUs: 2
Number of cores: 24
Number of threads: 24
...
Maximum memory requested that can be used=16200901024, at the size=45000
Performance Summary (GFlops)
...
Size LDA Align. Average Maximal
1000 1000 4 93.2748 115.6553
2000 2000 4 257.3304 268.6970
5000 5008 4 598.5752 612.6550
10000 10000 4 786.9355 812.9519
15000 15000 4 790.3809 799.4511
18000 18008 4 902.1919 903.0113
20000 20016 4 926.9572 927.2577
22000 22008 4 929.5465 930.6058
25000 25000 4 923.1123 924.7006
26000 26000 4 925.6606 928.4090
27000 27000 4 917.0465 917.0465
30000 30000 1 924.4760 924.4760
35000 35000 1 937.2601 937.2601
40000 40000 1 947.9258 947.9258
45000 45000 1 946.1127 946.1127
Residual checks PASSED
End of tests
Large problem size output (up to ~200GB),
Maximum memory requested that can be used=204803201024, at the size=160000
Performance Summary (GFlops)
Size LDA Align. Average Maximal
50000 50000 1 947.9553 947.9553
55000 55000 1 951.8988 951.8988
60000 60000 1 963.6150 963.6150
70000 70000 1 1032.6451 1032.6451
80000 80000 1 1042.3514 1042.3514
120000 120000 1 1069.4436 1069.4436
160000 160000 1 1077.8674 1077.8674
Residual checks PASSED
End of tests
NAMD Molecular Dynamics test (no GPU acceleration)
“Standard” problem sizes (up to ~ 16GB)
Mon May 16 18:09:08 EDT 2016 Intel(R) Optimized LINPACK Benchmark data ... Number of CPUs: 2 Number of cores: 24 Number of threads: 24 ... Maximum memory requested that can be used=16200901024, at the size=45000 Performance Summary (GFlops) ... Size LDA Align. Average Maximal 1000 1000 4 93.2748 115.6553 2000 2000 4 257.3304 268.6970 5000 5008 4 598.5752 612.6550 10000 10000 4 786.9355 812.9519 15000 15000 4 790.3809 799.4511 18000 18008 4 902.1919 903.0113 20000 20016 4 926.9572 927.2577 22000 22008 4 929.5465 930.6058 25000 25000 4 923.1123 924.7006 26000 26000 4 925.6606 928.4090 27000 27000 4 917.0465 917.0465 30000 30000 1 924.4760 924.4760 35000 35000 1 937.2601 937.2601 40000 40000 1 947.9258 947.9258 45000 45000 1 946.1127 946.1127 Residual checks PASSED End of tests
Large problem size output (up to ~200GB),
Maximum memory requested that can be used=204803201024, at the size=160000
Performance Summary (GFlops)
Size LDA Align. Average Maximal
50000 50000 1 947.9553 947.9553
55000 55000 1 951.8988 951.8988
60000 60000 1 963.6150 963.6150
70000 70000 1 1032.6451 1032.6451
80000 80000 1 1042.3514 1042.3514
120000 120000 1 1069.4436 1069.4436
160000 160000 1 1077.8674 1077.8674
Residual checks PASSED
End of tests
NAMD Molecular Dynamics test (no GPU acceleration)
Maximum memory requested that can be used=204803201024, at the size=160000 Performance Summary (GFlops) Size LDA Align. Average Maximal 50000 50000 1 947.9553 947.9553 55000 55000 1 951.8988 951.8988 60000 60000 1 963.6150 963.6150 70000 70000 1 1032.6451 1032.6451 80000 80000 1 1042.3514 1042.3514 120000 120000 1 1069.4436 1069.4436 160000 160000 1 1077.8674 1077.8674 Residual checks PASSED End of tests
NAMD Molecular Dynamics test (no GPU acceleration)
NAMD is a good CPU test code in my opinion. The parallel scaling of NAMD with threads or message passing is excellent. I have used the “standard” binary build for x86_64 multicore CPU version 2.10. There are more recent versions but I have some test data on Haswell for this version.
Note: If you are running NAMD and NOT using GPU acceleration then you should probably reconsider that since it has excellent GPU acceleration!
One of the things I like about this program for testing is that there is not that much advantage in recompiling using Intel compilers ( see my blog post about NAMD ). That means we get a test that may better represent how existing programs will perform on the new Broadwell Xeon E5’s.
The gains with the Broadwell Xeon over the Haswell Xeon is much more modest in this case.
NAMD stmv simulation 500 time steps (CPU only) — Intel Xeon E5 2687v3 (Haswell) vs 2687v4 (Broadwell)
Haswell E5-2687v3 | Broadwell E5-2687v4 | |||
---|---|---|---|---|
CPU cores | wall time | day/ns | wall time | day/ns |
1 | 4220.0 | 96.29 | 3747 | 85.3 |
2 | 2167.0 | 48.66 | 1919 | 43.8 |
4 | 1150.1 | 26.91 | 1001 | 22.8 |
8 | 612.9 | 13.59 | 547 | 12.3 |
10 | 494.6 | 11.65 | 440 | 9.79 |
12 | —– | —- | 369 | 8.18 |
16 | 313.9 | 6.93 | 281 | 6.17 |
20 | 268.3 | 5.51 | 231 | 4.97 |
24 | —– | —- | 195 | 4.16 |
40(HT) | 228.0 | 4.80 | — | —- |
48(HT) | —– | —- | 175 | 3.62 |
- Notes:
- These processors run at 3.5GHz Max-Turbo for the 1,2,4 core jobs and then 3.2GHz All-Core-Turbo for the rest.
The speedup using the v4 Broadwell Xeon is not nearly so dramatic with NAMD but it is still a nice speedup, and, there are 2 extra cores!
I’ll have another post up soon as a buyers guide for the new Broadwell processors that will show price, theoretical performance, and Amdahl’s Law scaling. That will be an update to the post that shows this information for the Haswell Xeon’s.
Happy computing! –dbk