The Software Side of Sears

Metascale leverages quite a few Hadoop-related tools, including Map Reduce, Hive, and Pig.

Thanks to Apache Hadoop and other data-analytics technologies, the international retailer Sears has managed to not only transform its IT operations, but also decommission some of its mainframe computers. The company has been so successful with this project that it has spun off the group responsible into a separate company that is now selling its services to others. Call this one of the bigger proof-points of using Hadoop in the enterprise, and come see the software side of Sears.

Several years ago, Sears faced serious IT challenges, including its legacy data systems exceeding mainframe capacity. Sears’ data architecture teams explored a variety of options before choosing Hadoop.

In spite of recent advances in computing, many enterprises are still running batch-oriented core business processes on mainframe computers, because the latter are effective computing workhorses with high performance and capacity. By using the parallel and distributed processing techniques of Hadoop, Sears found it was able to offload some of its batch processing, cutting costs without altering user-facing business applications. In total, Sears was able to eliminate more than 1,000 mainframe MIPS and save two million dollars a year by retiring one of their mainframes. The first Hadoop project took less than three months, moving more than seven thousand lines of COBOL code to just under 50 lines of Pig. Today Sears is running more than three PB of data on Hadoop.

Justin Sheppard, who is director of IT transformation for Sears Holdings, led the effort. “While Hadoop isn’t a cure-all and not going to solve every data problem in your organization, it has transformed our thinking about mainframe processing,” he said.

Sheppard presented his findings and methods at the recent Big Data conference StampedeCon in St. Louis. His core concept isn’t trying to replace mainframe batch processing with some other proprietary distributed computing, but leverage the power of open-source software and Big Data tools without having to migrate the entire end-to-end applications stack.

“Hadoop is not some high-speed SQL data base,” he said at the conference. “It isn’t simple, and isn’t going to be a replacement for your current data warehouse. It also isn’t going to be built or operated by your current database administrators, and initially may not even make sense to your data architects.” Yet with all these caveats, the team was able to pull off an impressive migration.

Sheppard now runs technology operations for the Big Data spinoff, called Metascale, which was created partly because Sears kept getting queries from other companies about how they could get started with Hadoop. “We were able to eliminate over 900 MIPs and an entire mainframe for one Fortune 500 client, and for another were able to reduce COBOL batch processing from over six hours to less than ten minutes in Pig/Hadoop,” he said. Metascale offers a full range of Hadoop-related services, including design and build, hosting, performance tuning, training, and managed services.

What were some of their lessons Sears learned in the process of becoming a full-service Hadoop service provider?

First, make Hadoop the single source of data truth within your organization.

Capture continuously all enterprise data at earliest touch points possible, deliver the data from all sources, through all source data systems, to Hadoop, and store the data under HDFS. The idea is to use Hadoop as the single centralized data repository, rather than as a rental space, and rely on dimensional modeling to turn it into an integration platform with low latency. While the concept of using a data hub isn’t new, what is different is to have this data hub under the complete control of the Hadoop ecosystem.

Second, instead of the traditional Extract, Transform and Load (ETL), look at the three T’s of Hadoop: Transfer, Transform, and Translate. Wuheng Luo, a programmer for Metascale, gave this presentation at the June Hadoop Summit where he explained his methodology. The goal is to simplify enterprise data processing and reduce the time that it takes to turn raw data into actionable intelligence that can better support business operations. While both methods use data transformation tools, with the Metascale method they happen “in vivo, within Hadoop, and using various Hadoop-based integration methods,” Luo said.

Third, choose analytical tools that are Flexible, Agile, Interactive and User Friendly. Luo’s presentation said his team uses “batch and streaming tools built on top of Hadoop to interact with data scientists and end users, and produces the business wisdom.” Having Web-based and graphic user interfaces also means data access happens more quickly and is available to a wider audience within the company—in other words, it isn’t just the province of a few IT geeks. Sears now uses Datameer for its business-user reports and queries, and finds it quite flexible. “Still, the learning curve can’t be underestimated,” Sheppard said.

Next, start thinking of how you can make Hadoop more enterprise-friendly; what Luo calls “Big Data 2.0.” Look at in-memory approaches for data transformations, reducing your end-to-end time from when data is created to when you can make decisions on it, and look at ways to increase ad hoc queries rather than coding everything in MapReduce.

Next, training is essential. Sears found that its old school COBOL programmers were able to learn Pig in just a few weeks and were quite comfortable using the new language. “We had problems finding and hiring skilled Hadoop developers, so we had to grow our own,” Sheppard explained. Sears tried a variety of programming approaches, including using MapReduce and Hive before settling on Pig with Java extensions. “Pig is very efficient and easy to code and easy to learn,” he added.  Having Linux and open-source knowledge and skills is also critical.

One of Metascale’s services is to offer training classes using “real data sets solving real business problems,” as their company literature promises.

Look at ways you can combine SQL and Hadoop approaches to leverage the best of both worlds. “You need to use the right tool for the right job,” Sheppard said.

Finally, think out of the box. “Using Hadoop, we can gain insights into the prices of items that we didn’t sell in our stores,” according to Sheppard. That is something to think about, as the insights gained from unsold inventory can complement sales data quite nicely. “We have seen spectacular results in moving our workloads from expensive, proprietary mainframe platforms to Hadoop.”