Where Software-Defined Networks Really Belong in a Data Center
Software-defined networks have begun to dominate the news, with experts debating SDNs, server fabrics, and lots of virtualized stuff (machines, I/O, networks, etc.) with respect to new cloud and mega-datacenter build-outs. Which brings up a big question: what is so new and different about these new at-scale deployments? To begin, let’s look at a typical enterprise network “fat tree” hierarchy: There’s a lot of redundancy built into this network architecture. Near the backbone, if a switch goes out of service or if a network link is lost, there’s plenty of redundancy to keep the datacenter moving. As you move to the rack level, if a TOR [Top-of-Rack] switch fails, the rack it serves will be out of action until the TOR switch can be replaced. If an aggregation switch fails, then the servers under its span will be out of action until it can be replaced. It works. It is expensive. And all those switches do is move data and runtime images around. With these new services-oriented, at-scale deployments it gets worse. The problem is that in a modern warehouse-sized datacenter, we’re stuffing more and more server nodes into each rack. That begs the question: what is a “server node” in today’s mega-datacenter context? For a visual answer, the first photo is a Quanta processor sled sitting on their 4U cloud chassis (taken at the Intel Developer Forum earlier this year). The sled has two processors on it—you can see the big heat sink stacks above the two Intel Xeon processors and a memory card by each one. Each processor has a memory card and support logic associated with it. Just as important, the two systems are not connected to each other; this is not a 2S (socket) or 2P (processor) symmetric multiprocessor (SMP) blade. Each of these two processor complexes connects to the datacenter through its own Ethernet link. We call each of these independent processor complexes a server “node,” and so this card contains two nodes.
Doing the math, this chassis can host 24 server nodes in a 4U rack, or 6 nodes per U. (One U is 1.75 inches of rack height.) Quanta is taking an incremental innovation approach to mega-datacenter infrastructure. This Calxeda processor card (photo below) contains four server nodes–you can see the four Calxeda EnergyCore SoCs (System on Chip, containing ARM cores) and memory associated with each processor. The difference between a processor and a SoC is that there is no other logic on the Calxeda boards; all of the logic to boot, manage and run a server node has been integrated onto one silicon chip. This Boston Viridis chassis hosts 12 of these Calxeda EnergyCards in a 2U chassis, for 48 server nodes, or 24 nodes per U: Just a few years ago, 1 node per U was state of the art. A node contained one or more single-core processor chips (typically 2 or 4 in an SMP configuration), a north bridge chip, a couple of south bridge chips, several memory sticks, et cetera. The two examples above cover 6 nodes per U with big honkin’ Xeon sockets and 24 nodes per U with Calxeda’s ARM-based SoCs. SeaMicro has implemented a server card with 6 Intel Atom processors on it–their 10U chassis hosts 64 of those cards for 384 nodes, or a little over 38 nodes per U. This is just the start. All of these systems have been shipping for less than a year. The difference between the traditional Quanta chassis, versus the future direction of the Calxeda and SeaMicro chassis, is in the network topologies they implement. If you look closely at the Quanta photo, you’ll see 3 x 1GbE (Gigabit Ethernet) ports on the front of each sled. That’s one per node plus a third, redundant port on each card in case one of the others fails. In this case that’s one Ethernet port per node (24 cables), plus one redundant port per card (12 more cables), representing power consumption on the sled, cabling complexity, and a huge amount of in-rack switch capacity. The 24 nodes with dedicated Ethernet bandwidth represent 36 switch ports and 24Gbps non-redundant Ethernet switch capacity, even though each node typically does not need anywhere near 1Gbps dedicated bandwidth. This solution does not scale very well. The Boston Viridis chassis has only two 10GbE ports (for redundancy) shared by all 48 nodes in the chassis. Two cables. It can talk directly to a TOR switch or even better an EOR switch. This is starting to scale well.