8 socket AMD server scalability issues and possible solution

Posted by: Billy Newport on

I was recently involved with a customer that was using an 8 socket AMD based server running a Unix. The machines had 32 cores each and a lot of RAM. Pretty impressive specification. However, we had a lot of problems getting the most out of those servers. The customer was running a single operating system instance (not Linux) on each server and two or three WebSphere JVMs on the server. We could not get the total CPU above about 50-60%. We tried running more JVMs but the upper limit did not move. There were no bottlenecks that I could see from a threading or I/O point of view.

My theory is that the 8 socket AMD box is basically 8 of [one socket with 4 cores, a memory controller and 1/8 of the total RAM]. If a core requires local memory (memory attached to its socket) then it's fast. If it requires memory attached to another socket then it needs to access that memory through the other socket. This seems to be where the bottleneck is. Our JVMs memory and threads may have been spread across the memory on all 8 sockets. Thus threads running on a core probably had a low chance (1 in 8) of using memory attached to that cores socket. Assigning a process to specific cores (a common trick for improving Lx cache utilization) wouldn't seem to help much here because it's likely the RAM needed by the process is on a different socket than the assigned cores are.

A better configuration for the box might have been to run VMWare on it and then have a virtual machine PER socket, or 8 virtual machines each running it's own operating system. I would then have run a single WAS JVM per virtual machine and that likely would perform very well. This would keep everything nice and local from a memory/thread affinity point of view and would likely result in much better performance as each virtual machine would have it's own socket/memory controller/memory for itself and everything for that virtual machine would now be socket local.

I found this benchmark which basically proves this out. They added virtual machines each time that they added another socket/processor card.Each virtual machine was basically running within a single processor card/socket and thus they saw very good scalability. It's a shame though that they didn't test a single virtual machines performance spread across multiple sockets which is basically what I think I was seeing on the hardware with one operating system image. It would likely be a much different story in that case.


About Billy Newport

Billy Newport

Billy is a Distinguished Engineer at IBM. He's been at IBM since 2001. Billy was the lead on the WorkManager/ Scheduler APIs which were later standardized by IBM and BEA and are now the subject of JSR 236 and JSR 237. Billy lead the design of the WebSphere 6.0 non blocking IO framework (channel framework) and the WebSphere 6.0 high availability/clustering (HAManager). Billy currently works on WebSphere XD and ObjectGrid. He's also the lead persistence architect and runtime availability/scaling architect for the base application server.

Before IBM, Billy worked as an independant consultant at investment banks, telcos, publishing companies and travel reservation companies. He wrote video games in C and assembler on the ZX Spectrum, Atari ST and Commodore Amiga as a teenager. He started programming on an Apple IIe when he was eleven, his first programming language was 6502 assembler.

Billys current interests are lightweight non invasive middleware, complex event processing systems and grid based OLTP frameworks.

More About Billy »

NFJS, the Magazine

December Issue Now Available
  • BDD and REST

    by Brian Sletten
  • Mocks and Stubs in Groovy Tests

    by Kenneth Kousen
  • Algorithms for Better Text Search Results

    by John Griffin
  • Knowns and Unknowns of Scrum and Agile

    by Brian Tarbox
Learn More »