“Big Data is Big Business with many new opportunities for technical people.“ I’m sure many of you have heard something like this. Last year, the Harvard Business Review went so far as to call data science “the sexiest job of the 21st century.” (I actually have this article framed on my office wall as I don’t think anyone will ever again call my job “sexy.”) My point is: Time and again, we’ve been told that there’s a dramatic lack of people with the right training and education in the technologies that support Big Data.
In CIO offices around the world, the big question being asked is, “do we have a Big Data program and if not, why not?” Thus, there’s a rush to find data scientists and Big Data engineering practitioners without really understanding the true questions about how their company will use their skills.
What does this mean for those of us in the technical community? In the short term, there’s an unprecedented demand for data scientists, analysts and engineers with backgrounds and experience in Big Data.
It also is driving a craze among professionals to get trained in technologies that support Big Data – primarily Hadoop/MapReduce – even though they may not have an understanding about how and when to use them. Is this yet another balloon waiting to explode? Too many people are rushing into this training with high expectations, but little idea of what they need to get out of it.
Will they learn about Hadoop and the MapReduce algorithms? Given the right training, yes, but what will you do with that knowledge? What questions will need to be answered? What data will you select from what is available and why? These can’t be addressed by simple training. You need real education, experience and understanding of how to use data to derive and justify the answers.
And training isn’t education or knowledge. Technical people aren’t learning about statistics or data analysis, or how and when to use different types of algorithms. That kind of knowledge takes time to get. In many cases it requires much experiential trial and error based on hypothesis, all to find patterns in the data that hopefully will lead to true understanding. And that understanding is necessary in order to address business questions in a rational way.
And what about those patterns in the data? Is there a trend? What does a slope of 15 degrees mean? What about outlier data? Does it say anything, or is the data just noisy? This isn’t something that you’ll understand by getting trained in Hadoop. An education in statistics will help –at least it will provide the basics – but it’s by working through problems and data that you’ll gain the real experience you need.
I don’t mean to dissuade you from becoming a data scientist if that’s what you want to be. I just want to point out it’s not something that you simply train for, get a certification in, and are done. At least not if you want to be good at what you do.
My advice: First, have patience and do not panic. Second, have an educational plan of what training AND what education you need. Your plan should weigh heavily on statistics, analysis and computer science. Third, be comfortable using databases – whether traditional SQL or NoSQL, it doesn’t matter, but they must have data in them that means something to you. Fourth, look for data-centric problems along the way and apply what you’ve learned to solving them.
Finally, if you want to be a data scientist you have to love data, you have to love statistics. If you are a baseball or football fan, leverage the statistics of your favorite player or team to learn about them and what they do.