I’ve often been asked to compare big, sophisticated Intel Xeon and AMD Opteron processors against Intel Atom variants, potential AMD “cat” derivatives, and a handful of ARM parts seeping into the server market at the low end. The usual request is to perform an apples-to-apples comparison of “brawny” vs. “wimpy” cores by theoretically removing the big-cored Xeon and Opteron parts from a two-socket or 4-socket server and replace them with “equivalent” small core parts.
That request doesn’t make a whole lot of sense, and it’s a manufactured argument designed to favor the incumbent big-core parts over a short-term time horizon. Here’s why.
The Demand-Side Case for Virtualization
Virtualization was invented a long time ago for one purpose – so that a single, expensive hardware resource might be better utilized by multiple software applications. Early OS implementations were designed to run in batch mode; only one application (or “job”) ran at a time. A job had access to all system resources because it was by definition the only job running. Instead of rewriting entire OS kernels to enable multitasking (which eventually happened via native development of new OS kernels), virtualization was invented to run multiple jobs by running multiple copies of an OS on one processor at the same time.
The result was that smaller jobs did not have to wait for bigger jobs to complete; lengthy storage request delays could be filled with other jobs, et cetera. Crises were averted, utilization improved, and virtualization was virtually forgotten outside of the mainframe market as brand new native multitasking OS kernels like UNIX eventually took over the market.
Fast forward four decades, and the current crop of big core server processors had a different utilization problem. After the invention of multi-threading and single-chip processors, we invented multi-socket servers. For a while those processors were not fast enough, even in multi-socket configurations, to keep up with application processing demand. But as processor speeds increased and two-, four-, eight- and other even more exotic n-way servers evolved, a curious thing happened.
As IT configured servers for peak demand, it realized that off-peak there was a growing portion of their servers that were substantially underutilized. And that all of those underutilized servers were spending a lot of power staying on at very low utilization rates…often under 10 percent of capacity.
Therefore, virtualization re-emerged with a new mission: consolidate a bunch of application instances onto fewer servers so that a portion of the datacenter might be better power managed off-peak (i.e. shut some servers off while not needed). Even so, today most IT shops are concerned when they see utilization peaks above 60 percent capacity.
The challenge today is that hardware thread-level delivered performance has, in general, outrun software thread-level performance demand. All of the talk you hear about datacenter consolidation and high demand for virtualized servers is built around this observation.
When you buy a two-socket server with 8 Hyper-Threaded cores per socket, you are buying a server with the capacity to run 32 simultaneous threads. The purpose of virtualization today is to keep all of those hardware threads busy by aggregating a bunch of application instances, many of which use a small number of multiple threads themselves (maybe 4-8 threads, more is rare in general IT applications).
To keep pace with this escalation in thread resource proliferation, socket- and rack-level I/O and networking subsystems also evolved. Their evolution has been directed at consolidating a wide variety of applications found in a typical enterprise IT datacenter, and they have also kept the philosophy that hardware resources are more expensive than software resources, and therefore require certain levels of hardware assurance and redundancy which incur even more additional cost.
We’ve come a long way from virtualization enabling access to a single expensive resource. But our current hardware compute resources are still relatively expensive, in that we’re still designing systems as if access to a core and raw core performance are rare commodities. They are obviously not.
Well, What If We Don’t Virtualize?
Now we’re ready to address the heart of the potential small core opportunity. Let’s create a new set of assumptions for a server processor:
- Cores will be right-sized (instruction set, speed, power, etc.) to run a certain class of threads.
- Each core will run exactly one thread at a time (like those old batch-mode mainframes).
- When a core is not running a thread, the core will be powered down.
- When a core is running a thread, the core will run as fast as it is allowed.
- An executive-level application manager exists to match threads with cores – it performs load-leveling, power management, etc. so that the processor doesn’t have to.
- Processor I/O is optimized for the target class of threads and the number of cores on the processor. This includes network I/O and local I/O (memory, disk, etc.).
The first thing we notice here is that we can now tightly optimize a processor’s external bandwidth for a given class of threads and core count. Cool!
Why would I build a traditional enterprise two-socket server out of such a processor? It limits me to standard Ethernet switch-based, redundant path in-rack network topologies, traditional SAN and NAS storage architectures, little opportunity to differentiate from current processors on compute density per U-factor, etc.
Hence my initial statement about such a comparison generating a manufactured conclusion–there is no advantage for small core parts in today’s highly standardized modular enterprise rack architecture. In order to take advantage of small core economies of scale we need new rack-level network topologies (this is where network “fabrics” enter the picture, but detailed discussion of fabrics is out of scope for this column).
Does this mean that we’re not talking about running general enterprise IT applications on small cores? Yes, that is correct.
What Applications Do We Want to Run on a Small Core Processor?
Let’s talk about large datacenters with these qualities:
- They control their own runtime environment; they have access to source code for everything they run.
- They run homogeneous pools of the same application instance over many racks, so that they can leverage economies of scale for purpose-built rack-level architectures.
- Thread-level performance demand for a target application class is well below current big-core processor capabilities.
Many Software-as-a-Service (Saas) vendors match these qualities, including large portions of search, mapping, advertising delivery, edge caching, etc. These are not small classes of applications. And these folks are all interested in finding lower-cost hardware and operating expense optimization points for their scale-out workloads.
Also note that that last bullet is a moving target. As smaller cores get faster (and they will, though not at the same pace as in previous decades), more applications classes will be open to them.
Given control of all source code, there is substantial opportunity for an alternate ISA such as ARM to make inroads, if it can prove itself competitive within a given application class.
In addition, let’s extend small core use cases to merchant cloud environments and introduce some light business constraints on customers:
- Constrain customers to compiling for a constrained ISA. In a previous column I described the proliferation of x86 ISA extensions added over the years by Intel and AMD and the problems that causes with migrating virtual machines. This is not an onerous requirement for customers who have a mandate to migrate their code to the cloud.
- Constrain customers to writing for managed runtime environments and high-level abstractions like OpenCL. This does away with specific processor ISAs completely. The overwhelming majority of code being written today, both for B2C and for B2B, is written to managed runtime environments (overwhelmingly J2EE and .NET in IT shops, others in SaaS and cloud).
Both of these options would enable small-core processors for comparatively low performance service level agreements (SLAs). For an application that needs low performance per core and lots of cores, buying time on lots of cheap cores would make economic sense.
Both of these options leave an opening for ARM, though there are stronger legacy constraints that point to small core x86 processors as a more convenient short-term solution for merchant clouds.
Hosting businesses will be somewhat caught in the middle. If their customer target is heavy on legacy IT applications, then they are better off with traditional IT hardware solutions. If they have some control over compile-time or run-time specifications, then they should be able to take advantage of small core processors, although small core x86 processors would be heavily favored.
I don’t believe there are short-term opportunities to “future proof” your large data center against new scale-out rack-level architectures. We’re approaching one of those rare technology transitions where the dominant design for scale-out architecture will change significantly, perhaps radically, in 3-4 years.
A Scary Example
- “When entering World War II, America’s mass production capacity enabled her to rapidly construct thousands of relatively cheap M4 Sherman medium tanks. A compromise all round, the Sherman was reliable and formed a large part of the Anglo-American ground forces, but in a tank-versus-tank battle was no match for the Panther or Tiger. Numerical and logistical superiority and the successful use of combined arms allowed the Allies to overrun the German forces during the Battle of Normandy.”
- “…the fearsome quality of a few German heavy tanks and their crews could sometimes be overcome by the quantity and mobility of the Shermans, supported by artillery and airpower, but sometimes at a great cost in U.S. tanks and crewmen.”
Note the reference to “combined arms” and “supported by artillery and airpower.” I’ll use these as proxies for “balanced system design.” Tanks require infrastructure in battle, just as processors require infrastructure in a datacenter.
Do not buy into the biased language of the “wimpy” vs. “brawny” core size debate. However, it is not enough to simply replace big core processors with small core processors–rack-level network architecture must also evolve to enable small core performance for specific workloads.
Compute cycles are evolving into a fungible commodity–some applications still require a high concentration of compute cycles within a single processor socket, but those are becoming increasingly rare as processor technologies and I/O topologies evolve.