NSF Grant Funds Virtualized Supercomputer Aimed at ‘The Masses’

Heavily virtualized supercomputing cluster “Comet” will join SDSC’s flash-heavy “Gordon” by 2015.

The National Science Foundation has awarded a $12 million grant for the world’s first virtualized supercomputer cluster that will serve research disciplines that rarely have access to supercomputers.

The award from the National Science Foundation (NSF) to the San Diego Supercomputer Center (SDSC) will pay for “Comet,” a supercomputing cluster to be built with Dell servers with a pair of next-generation Intel Xeon processors on each node, reinforced with 128GB of DRAM and 320GB of flash memory and run at a peak capacity of nearly 2 petaflops, according to the SDSC announcement.

A series of large-memory nodes, designed for graphically intensive data visualization or modeling, will boost the DRAM up to 1.5TB and include Nvidia GPUs as accelerators.

Workloads on Comet will run in virtual clusters that SCSD predicts will be able to operate at nearly the speed of their underlying hardware, by relying on advanced Single Root IO Virtualization (SRIOV), InfiniBand interconnects and applications based on Message-Passing Interface (MPI) parallel-computing architectures.

SRIOV is a specification developed by the PCI-SIG industry consortium that allows a device with a single PCIe data bus to divide its capacity to look like several PCIe devices; support built into the hypervisor allows virtual machines running on the system to use that device as if it each had its own dedicated I/O port.

Comet, which is due to come online in 2015, will join “Gordon,” another supercomputer of unusual design at SCSD. Gordon, which went into operation early in 2012, is the first high-performance-computing implementation of large amounts of flash storage (300TB) to boost performance.

Gordon is designed for workloads that require enough data to otherwise overwhelm the available memory, and would run far more slowly by constantly having to call data from disk storage. Gordon’s flash storage allows the system to keep more of the relevant data close to the processors, stored in memory with far lower transport and hardware-access latency than disk.

Gordon is heavily used in video analysis as well as other data-intensive calculations.

Comet will also join Gordon as part of the Extreme Science and Engineering Discovery Environment (XSEDE) project – a scientific-research resource-sharing organization in which member universities make high-performance computing resources available to members of the network in an arrangement akin to traditional time-sharing.

As the first XSEDE system based on high-performance virtualization, Comet will allow each member organization to define their own virtual clusters and software stacks to address them.

The goal is to provide supercomputer time for research from users in disciplines other than chemistry, physics, genomics and other traditionally heavy users of supercomputers. Most supercomputer time goes to high-end visualization and modeling workloads that can’t run anywhere else, but those represent only a fraction of the research that could be advanced using supercomputers, according to Barry Schneider, program director for Comet in NSF’s Division of Advanced Cyberinfrastructure in a statement announcing the award.

Comet will serve the “so-called long tail of science” by providing supercomputing resources for analyses too resource-intensive for systems available to most researchers, but far smaller than the type of research to which most supercomputers are dedicated.

“We are supporting Comet to provide a resource not just for the highest end-users, but for scientists and engineers across a broad spectrum of disciplines,” Schneider said.

Comet is designed to run at a peak capacity of just under 2 petaflops – speed that would make it one of the 15 fastest supercomputers if it were to go into operation today.

In the completed cluster, each rack of 72 nodes will have a bisection InfiniBand FDR interconnect (13.64Gbit/sec per node) within the rack and a 4:1 interconnect (54.54Gbit/sec) between racks. The heavy interconnection allows the cluster’s capacity to be divided more efficiently among many small workloads rather than one or two large ones.

When it comes online in 2015, Comet will replace SDSC’s Trestles cluster, which will be retired in 2014 at the ripe old age of four.

Image: San Diego Supercomputing Center/Univ. Calif.-San Diego

Post a Comment

Your email address will not be published.