The fastest supercomputer in the world, Oak Ridge National Laboratory’s “Titan,” has been delayed because an excess of gold on its motherboard connectors has prevented it from working properly.
Titan was originally turned on last October and climbed to the top of the Top500 list of the fastest supercomputers shortly thereafter. However, Oak Ridge Today reports that the so-called acceptance testing was never formally completed, meaning that the system is not considered stable enough for full production. Instead, formal testing could be delayed until the end of April, four months later than originally planned.
Problems with Titan were first discovered in February, when the supercomputer just missed its stability requirement, passing 92 percent of the jobs in a mandatory test of its systems (the threshold was a 95 percent completion rate, reported Frank Munger’s Atomic City Underground blog). At that time, the problems with the connectors were isolated as the culprit, and ORNL decided to take some of Titan’s 200 cabinets offline and ship their motherboards back to the manufacturer, Cray, for repairs. The connectors affected the ability of the GPUs in the system to talk to the main processors.
In a statement provided to Slashdot, ORNL confirmed Titan’s status. “Titan has not yet completed the full suite of acceptance tests but has successfully passed both the functionality and performance phases of acceptance testing,” the lab said, in a statement jointly attributed to James J. Hack, the director of the National Center for Computational Sciences, and Arthur S. Bland, the project director at the Oak Ridge Leadership Computing Facility. “Moreover, Titan is within 1 percent of passing its stability test, the last component of the acceptance test suite. The original project schedule called for fully completing acceptance testing by June of 2013, a schedule we expect to meet. And, as we proceed through this complex testing procedure, users are making productive use of the system.”
Munger later reported that the equivalent of 24 cabinets were being tested per week at the Cray facility.
Munger also reported the problems with the connector pins, which Oak Ridge Today‘s John Huotari noted was due to too much gold mixed in with the solder. Gold is used for connectors because it does not oxidize quickly, and because of its high electrical conductivity; however, when mixed with solder that contains tin, the gold and tin can combine, making the combination brittle (PDF) under certain conditions. Cray is reportedly replacing the connectors to alleviate the problem.
There are about 20,000 connectors within the Titan, connecting the CPUs and GPUs. Each connector has about 100 pins, ORT reported.
According to the Top500 list, Titan contains 2.2 GHz 16-core AMD Opterons combined with Nvidia K20X Kepler chips, with a grand total of 560,640 cores generating a maximum (Rmax) processing power of 17.6 petaflops. The problems with the connectors apparently allow Titan to be used, albeit without the benefit of the GPUs.
Once the new components have been received by early April, the acceptance test (which ORT said required 14 days to complete) will be run one more time. Given how close that Titan came to completing its last run, it seems likely that the system will pass.
For now, Cray is accepting the cost, as well as (presumably) the blame. ORNL representatives said that Titan’s ranking in the TOP500 list was unrelated to its acceptance testing, so Titan may retain its crown, at least. Still, as Cray tries to push its latest XC30 supercomputer to new customers, Titan may provoke some uncomfortable questions.