I was recently involved with a customer that was using an 8 socket AMD based server running a Unix. The machines had 32 cores each and a lot of RAM. Pretty impressive specification. However, we had a lot of problems getting the most out of those servers. The customer was running a single operating system instance (not Linux) on each server and two or three WebSphere JVMs on the server. We could not get the total CPU above about 50-60%. We tried running more JVMs but the upper limit did not move. There were no bottlenecks that I could see from a threading or I/O point of view.
My theory is that the 8 socket AMD box is basically 8 of [one socket with 4 cores, a memory controller and 1/8 of the total RAM]. If a core requires local memory (memory attached to its socket) then it's fast. If it requires memory attached to another socket then it needs to access that memory through the other socket. This seems to be where the bottleneck is. Our JVMs memory and threads may have been spread across the memory on all 8 sockets. Thus threads running on a core probably had a low chance (1 in 8) of using memory attached to that cores socket. Assigning a process to specific cores (a common trick for improving Lx cache utilization) wouldn't seem to help much here because it's likely the RAM needed by the process is on a different socket than the assigned cores are.
A better configuration for the box might have been to run VMWare on it and then have a virtual machine PER socket, or 8 virtual machines each running it's own operating system. I would then have run a single WAS JVM per virtual machine and that likely would perform very well. This would keep everything nice and local from a memory/thread affinity point of view and would likely result in much better performance as each virtual machine would have it's own socket/memory controller/memory for itself and everything for that virtual machine would now be socket local.
I found this benchmark which basically proves this out. They added virtual machines each time that they added another socket/processor card.Each virtual machine was basically running within a single processor card/socket and thus they saw very good scalability. It's a shame though that they didn't test a single virtual machines performance spread across multiple sockets which is basically what I think I was seeing on the hardware with one operating system image. It would likely be a much different story in that case.
