What Does It Mean to Be a Data Scientist?

shutterstock_Arthimedes

At Dice, we’ve had a data-science team for two years. As research and development for the firm, we’ve worked on a number of different projects; although many haven’t hit the site yet (stay tuned), a few of our earlier projects have rolled out over the past 12 months.

For example, last January we replaced the “More Jobs Like This” section on the jobs pages with a custom-built recommender solution. More recently, on the employer side of the site, we released a “More Candidates Like This” section. Also on the employer side, in the candidate search functionality, your search terms will result in suggestions for related skills based on our mining of resume and jobs data.

Employer Search

At this point, we’d like to share with you some of the work we’ve been doing, as well as some of the interesting conversations we’ve had about the data-science space. At times we may get technical, show some neat visualizations of our data, or wax lyrical about current and emerging trends in the industry. Interested readers can expect a post from us at least once a month at around the same time.

As this is the first post in the series, I’m going to define what it means to be a data scientist, at Dice anyways.

Check out the latest data science jobs.

To be honest, I often don’t tell people I am a data scientist. It’s not that I don’t enjoy my job (I do!) nor that I’m not proud of what we’ve achieved (I am); it’s just that most people don’t really understand what you mean when you say you’re a data scientist, or they assume it’s some fancy jargon for something else (just as we’d use “refuse collector” in place of “bin man” (I’m British) or “garbage man”). I have to admit, I have the same reaction to a lot of modern jobs: I don’t feel I really understand what an actuary does, other than assess risk, which is like saying, “I work with data.”

If I do answer, I normally follow up with analogies. In the modern world, data science is pervasive; it impacts a lot of what we do, particularly online. So I talk about how I work on recommender systems, citing Netflix and Amazon as examples, and work on enhancing our search engine. The latter always involves a reference to Google, the pinnacle of search sophistication. However, that belies a lot of the more exciting and important work that we’ve been doing and will be doing.

I’m not the first to try and define “data scientist,” and I won’t be the last. What do others have to say? In 2012, in an effort to grab headlines, Harvard Business Review proclaimed “Data Scientist: The Sexiest Job of the 21st Century.” That article spawned a popular meme, leading to a previous employer referring to me as “that sexy data scientist.” The article portrays the data scientist as an intrepid explorer of data, seeking out hidden insights from within a corporation’s datasets, in order to revitalize or transform their employer’s business. Meanwhile IBM, one of the first companies to truly embrace data science, provides a less romanticized and more pragmatic description.

IBM stresses that a data scientist, above all else, must have a strong business acumen and be able to effectively communicate ideas to core decision makers in the organization.

A number of common themes stand out as you read through the different descriptions of the profession online:

High Business Impact

It is often said that data scientists can have a disproportionately large business impact in comparison to their numbers. The goal of the data scientist is to garner deep insights from a corporation’s data to help drive future decision-making processes. If done right, this can help chart the course of a company’s future development.

The Data Scientist as Polymath

While a lot of modern technology professions require a large range of skills (see The Rise and Fall of the Full-Stack Developer), the data scientist may have the most diverse skill set. A typical data scientist has knowledge of statistics, strong math skills (in particular linear algebra and probability theory), and the ability to work with data visualization techniques and tools (such as D3.js or Tableau), SQL, several Big Data technologies (such as MongoDB and Hadoop), and cloud platforms such as AWS; in addition, he or she is an adept programmer, and has a good knowledge and understanding of business.

A Shortage of Qualified Data Scientists

Such diverse and wide-ranging requirements can in part explain the shortage of actual data scientists in the modern labor market. Within the Dice offices, where we pay particular attention to popular skill sets in low supply, we’ve dubbed data scientists with extensive skill sets “pink unicorns.” The shortage was previously blamed on the lack of courses that teach all the skills necessary to become a data scientist, although today there are a lot of Master’s courses and boot camps with the express purpose of teaching data scientists everything from machine learning and Hadoop to statistical analysis.

So now that we’ve defined what a data scientist is, what are some important (and often overlooked) qualities of a good data scientist?

A Scientific Mindset

Probably the most important attribute of a data scientist is the possession of a scientific mindset. It’s been said that any subject with “Science” in its title is not a science, but I don’t think that applies to data science. While it’s important to know the key algorithms and their limitations, it’s nearly impossible to reliably predict which approach will be most effective on your data without running a series of experiments. The experimental method is also important when digging into your data: You need to spot patterns, formulate hypotheses and then test them by formulating queries, running statistical analyses or visualizing the data in some way.

A scientific theory is reliant on empirical evidence to be tested. Likewise, as a data scientist, I only believe in what our data tells us. I am suspicious of theories about our business or our customers that our data does not support. All good scientists are skeptics at heart; they require strong empirical evidence to be convinced about a theory. Likewise, as a data scientist, I’ve learned to be suspicious of models that are too accurate, or individual variables that are too predictive. Most of the time, it means some subtle data leakage has occurred, or there’s a bug in your code.

Solid Programming Skills

Predictive modeling and statistical analysis are important tools in a data scientist’s toolbox. However, first and foremost a data scientist must be a competent programmer. It’s often said that a data scientist spends the majority of his or her time cleaning and preparing data; while I feel this “fact” is a little exaggerated, and very dependent on the data available, being able to program well is a very important skill. Everything that a data scientist does, from predictive modeling to data visualization and automating experiments, requires computer programming. We once interviewed a candidate who was exceptional at creating predictive models; however, we had to turn the person down, as they were unable to write code to process a flat file that was in a simple but non-standard format.

Promotes Data Science

As a data scientist, it’s important that you are an evangelist for your profession. Despite the awareness of the value of Big Data, it’s still hard for businesspeople to truly understand the scope and power of data science and what it can do for their organization. There are several different reasons for this: First, data science is still a very new profession and not practiced much in smaller companies, so it’s hard for some people to understand what it can do if they haven’t worked with a data science team before. Also, the sort of solutions data science can produce differ widely by industry; what works for one company doesn’t necessarily translate to another.

In addition, the dynamic nature of a predictive model can be hard to understand in comparison to the implementation of some business logic or a user interface. Models make mistakes, often ones that are obvious to a person, and businesspeople can have a hard time understanding that. Thus, it’s important that a data scientist educate the business about data science and how it can be used to effectively solve business problems, and where its limitations lie. For this, domain knowledge is very important, as mentioned earlier.

Uses the Right Tools

There are a plethora of tools for data science, from machine learning to statistical analysis and crunching large datasets. It can be very tempting to spend a lot of time researching different tools, and using the coolest new toys to solve a particular problem. However, it’s important to actually get some work done, and there’s only so much time you can spend evaluating tools: You need to be selective, and listen to what other people in the industry recommend for similar problems.

The technology industry is as much driven by fads as the fashion world, and there is a tendency to try to use new technologies for problems they aren’t suited to handle. The best and most commonly stated example of this is Hadoop. A lot of companies seem to be under the impression that if you’re not using Hadoop, then you are not doing data science. The reality is that a lot of businesses don’t have the amount of data that warrants a Hadoop cluster. For those that do, it may still not be the best tool out there; certain tasks, for instance certain machine learning algorithms, have to be executed in a serial manner and cannot take full advantage of MapReduce.

Similarly, Hadoop is not a good tool for running complex queries, which is one of the reasons that Google has moved away from the pure MapReduce paradigm they invented into more complicated systems such as Spanner. At Dice, we find Amazon’s RedShift more than competent for most of our Big Data-processing needs, and also leverage Apache Spark for some of the most processing-intensive tasks.

In this post, I’ve hopefully given you a taster for what it means to be a data scientist, and drawn attention to some often-overlooked qualities of a good data scientist. In the future posts, we’ll start to explore some of the underlying trends in the industry, show some interesting insights into our data and delve deep into some of the technical solutions we’ve developed using our data to solve real problems.

Simon Hughes is the chief data scientist of the Dice Data Science Team.

Upload Your ResumeEmployers want candidates like you. Upload your resume. Show them you’re awesome.

Related Articles

Image: Arthimedes/Shutterstock.com

Comments

8 Responses to “What Does It Mean to Be a Data Scientist?”

February 12, 2015 at 9:24 pm, Scott said:

Perhaps you should practice explaining it to a 5 year old.

Reply

February 19, 2015 at 12:51 pm, Eric Ayeh said:

Nice article! You mentioned that there is a shortage of data scientists in the market. You also mentioned that candidates do not have the right set of skills that are required. As a senior data scientist, what advice do you have for people like me that are trying to become a data scientist?
Thank you!

Reply

February 20, 2015 at 3:37 pm, Maria said:

I aspire to become a data scientist as well. I would love to hear any advice you may have.

Reply

March 12, 2015 at 10:02 am, Simon Hughes said:

Sorry for the late response. I can absolutely give some advice, and that might be a great additional posting. There’s several options really. I took the hard road, I went back to school and started on a PhD program in Computer Science Machine Learning, before the term “Data Scientist” was even popular. However, a PhD program is a lot of work and can take 4-7 years so that’s obviously not for everyone. You definitely don’t need a PhD to become a data scientist. You’ll definitely have an advantage in the jobs market, but the demand is so great right now I wouldn’t recommend it as it may not be by the time you graduate. Plus being paid next to nothing for 4-7 years is not an option for everyone. I went part time at DePaul in Chicago, but those programs are also quite rare. So the practical paths I would recommend are:

1. Find a company that needs good data people but lacks them and the resources to hire experienced people. Most companies can better leverage data. They may not need a full time data scientist, but there’s likely a need for someone who can work with their database etl packages and also has time to do some data science. There’s a lot of good MOOC’s out there for learning more about data science and statistics, for instance the excellent Machine Learning course by Andrew Ng on Coursera, and Udacity has a lot of good courses on data science and machine learning.

2. Get a relevant Masters course either in Data Science, Predictive Analytics or Statistics. There are a plethora of universities offering 2 year programs in these fields across the country.

3. Apply to a fellows program (http://insightdatascience.com/) or a data science bootcamp (google “data science boot camp” there’s a lot popping up). These are typically 10-12 week programs that teach you to become a data scientist. The fellows program is aimed at PhD’s from the sciences that want to transition, are are funded by corporations, whereas the boot camps take anyone with reasonable technical skills but you have to pay them. It’s not cheap, but it’s less costly than a masters and they help place you. I don’t have any first hand experience with boot camps however.

Reply

March 13, 2015 at 12:42 am, Eric said:

Simon,

Thank you for the detailed response. Effectively going the PhD route is not for everyone. I actually have a PhD myself but just recently developed an interest for data science. My background was more on statistical signal processing.
I have been taking the data science course offered by Johns Hopkins University (https://www.coursera.org/specialization/jhudatascience/1?utm_medium=sigtrackLanding) which is based on the R programming. However, looking at job descriptions and talking to recruiters, it seems like most companies focuses more on tools in the Big Data realm such as Hadoop. As a result, I started questioning whether I was on the right path in my quest of becoming a data scientist.
Your recommendations are thus very welcome and will be followed.

Thanks
Eric
So what are some of the skills that a data

Reply

March 30, 2015 at 1:04 pm, Simon Hughes said:

Eric,

I would say your PhD should give you a good background to study this subject. There is an unfortunate focus on Big Data skills for data scientists, the problem is a lot of companies equate data science with big data, and often when they ask for a data scientists they’re looking more for people to do data engineering (create hadoop jobs and so on) and maybe do a little machine learning analysis. So I would be wary of those job descriptions unless there’s also an emphasis on R Python and statistics or machine learning or data analysis.

Also note that a lot of companies have very unrealistic expectations about big data skills. Technically I didn’t have a lot of the skills that were originally in my job description when I was hired, so even if you don’t look to meet all the requirements as long as you have useful skills and experience, I would encourage you to apply anyway. As someone who spends a lot of time analyzing IT job descriptions, this is quite common. As long as you’re honest on your resume the employer recruiter should be able to judge whether you meet the criteria or not.

I would also attend any data science machine learning statistics meet-ups in the area. That’s how I found out about this job, and also how we hired one of our data scientists.

Simon

Reply

April 10, 2015 at 7:44 am, Eric said:

Simon,

Thank you for your response, encouraging words and advice.

Eric

March 12, 2015 at 10:05 am, Simon Hughes said:

Eric, see my response to maria below about how to become a data scientist.

http://insights.dice.com/2015/02/12/what-does-it-mean-to-be-a-data-scientist/#comment-2618629

I would also mention attending meetups. One of our recent hires taught himself machine learning in his spare time using coursera courses and started attending data science related meetups in the area. I also got my current position from bugging the host of a local meetup. So take time to learn the basics yourself (by coding not just reading) and then use whatever networking opportunities you can leverage. Giving a talk about a technical subject at a meetup can be one way to impress on people you may have what it takes.

Reply

Post a Comment

Your email address will not be published.