Need the most compute capability you can get in a single box for a well written, multithreaded application? We’ll take a look at one such application, Zemax OpticStudio14, running on a quad socket Ivy Bridge Xeon system. Performance was excellent!
Quad socket, high core count systems, can provide optimal performance for software that has good SMP thread scaling but that is not designed for, or just not practical to use with, traditional HPC distributed memory, multi-node, cluster systems. With current generation many-core quad socket systems it’s possible to have the compute capability of a small cluster in a single system. This has the advantage of only needing to maintain a single system and may have an advantage for software requiring commercial licensing too. As long as that software can take advantage of threaded parallel execution on many compute cores, a quad socket system is going to be as good as it gets. This is ideal for many compute intensive Windows applications that have been developed using efficient multi threading libraries that typically allow for as many as 64 threads. Another use case of many core quad socket systems would be for workloads that require many independent applications/instances such as virtualization. However, we are considering only a case of well a written thread parallel compute intensive application.
The software we are looking at for this test case is Zemax OpticStudio. This is high-end engineering software for optical, illumination, and laser systems design running on Windows. Testing was done as a courtesy for a prospective customer with cooperation from the software vendor, Zemax. The program test version was Zemax OpticStudio14 Sp1.
The performance metric is “million ray-surfaces per second” using the Zemax sample system “Double Gauss 28 degree field” which is their standard benchmark job run. The job has relatively short run time of less than a minute but it did show the thread scaling that we were most interested in. The software is designed to take advantage of up to 64 process threads.
The test system was a Puget Peak Quad Xeon Tower
- 4 x Intel Xeon E5-4624Lv2 (1.9GHz) TEN CORE
- 16 x 4GB DDR3-1600 REG ECC memory
- …
- Windows Server 2008R2
Note: this is a testbench system using quad socket Ivy Bridge engineering sample CPU’s. We normally configure our Quad Xeon systems with higher clocked CPU’s.
Our biggest open question when we started the testing was –how well will the code scale on the 40-cores of the test system? Turns out going from 10-cores to 20-cores gave nearly perfect linear scaling, essentially doubling the performance, and going from 10 to 40 cores gave a speedup of 3.7 over the 10-core performance, which is still very good scaling. We expect the “sweet spot” for thread scaling to be at 32-cores. Couple that with a higher CPU clock and you have the basis for our recommended optimal system for this application.
Performance results, “Double Gauss 28 degree field”
* The baseline reference was the customers dual Xeon system, E5640 @2.66GHz (8 total cores)
** Based on the scaling and performance we can make a good prediction about our recommended system for this application — 4 x Intel Xeon E5-4627v2 (3.3GHz) EIGHT CORE. The typical job times for the customer we were doing this testing for were over an hour so we are confident that considering a higher clocked processor for optimal individual thread performance and lower core count for optimal thread scaling would be the best overall system recommendation.
Performance is predicted by considering the difference in scaling at 20 and 40 cores obtained from the test system [observed performance * thread scale factor * clock scale factor]
508* (32/40.0)*(3.3/1.9) ans = 705.85 272* (32/20.0)*(3.3/1.9) ans = 755.87
Thus the recommended quad 8-core 3.3GHz system is predicted to achieve performance between 706 and 756 million ray-surfaces per second.
Our test numbers were enthusiastically received by the Zemax team and we were told that these were the best performance numbers they have seen reported for their software!
A quad socket, high core count, high CPU clock, Xeon Ivy Bridge based box is going to be hard to beat for a compute intensive, multithreaded SMP application with good thread scaling. The improvements Intel has made in the Ivy Bridge version of their 4600v2 series Xeon processors make it a formidable single box compute platform. A few years ago I would not have recommended a quad socket system except in unusual circumstances. I have done some other testing with our testbench quad Xeon system and have been pleasantly surprised that memory contention, processor affinity, and poor thread scheduling problems have mostly disappeared. In general a quad socket system will cost more than multiple dual socket systems but if you are restricted to running your jobs on a single box the performance is impressive!
Happy computing –dbk