Main image of article How to Become a Data Scientist

What is a data scientist? If you ask the Harvard Business Review, it’s the “sexiest job of the 21st century.” If you ask a technologist interested in crunching data, they’ll tell you it’s a potentially lucrative, intellectually fulfilling career. And if you ask a CEO, they’ll probably say that data scientists mean the difference between strategic success and failure. But how do you actually become one?

At the most basic level, data scientists analyze massive datasets for insights that can change how companies operate and strategize. 

The term “data scientist” originated sometime around 2008. Companies such as Facebook and LinkedIn needed specialists who could pioneer new ways to process and analyze massive amounts of data. In the case of social-media giants, the task was to sift through data and come up with matches—connecting people with people, people with ads, people with subject-matter pages, and so on. 

As more companies have awoken to the potential for data, the need for data scientists has only expanded. For example, grocery store chains might want to determine which products sell the most in particular areas, and from there try to predict where to place new products; automotive companies today collect diagnostic information on cars so that they can notify customers if their oil is low or a component is malfunctioning.

While processing data, data scientists might run into a number of issues, including missing or bad data. This requires “cleansing” the data before using it, usually with statistical approaches. For example, if you have a “messy” dataset of weather data, you might be able to use what’s called interpolation to get a rough idea of the probable weather on the day the data was missing or wrong. The same process can be used for missing data points.

Data scientists also make predictions. Companies that sell a product need to know how many items to manufacture each month based on previous sales data. Sales can fluctuate throughout the year; during the holidays, some products sell in massive quantities. A data scientist can stop a company from producing too much of a product (with inevitable cost overruns), for instance.

Obviously, these scenarios are complex, which is why data science is considered a “multi-discipline,” pulling liberally from mathematics, statistics, and coding.

Data science is also very much a growth industry. According to Burning Glass, which collects and analyzes millions of job postings from across the country, data scientist positions are expected to grow 19 percent over the next 10 years.

Table of Contents

Skills You Need to Become a Data Scientist

Over the past few years, multiple studies have confirmed what many data scientists already know: Python is a necessary programming language when it comes to anything related to data crunching. Although R remains a favored language for data analysis among many academic institutions, Python has the advantage of being an immensely popular “general” programming language, which means that many technologists who enter the data-science field have already learned it in school or on their own.

In addition to its versatility, Python can also scale, which is a feature that eludes other languages. “Combining R and Python is both reasonable and feasible,” Enriko Aryanto, the CTO and a co-founder of the Redwood City, Calif.-based QuanticMind, a data platform for intelligent marketing, once explained to Dice. “We run them both in our data science platform internally. But if I were starting my career all over again today, I might consider focusing on Python rather than R. It’s a more-general language with broader applications.”

But building out a career in data science doesn’t hinge solely on mastering a handful of languages. The best data scientists approach the daily challenges of their jobs with a logical, ordered mindset. Anthony J. Scriffignano, Ph.D., senior vice president and chief data scientist for Dun & Bradstreet, told Dice in an interview that he breaks a data-science problem into five distinct parts:

  • Discovery: Identifying the data you need.
  • Curation: Determining how that data fits with other data you may already have.
  • Synthesis: Analyzing the data and drawing insights from it.
  • Fabrication: “Packaging” the data for consumption by other stakeholders.
  • Delivery: Actually presenting the data (visualizations help!).

At its core, data science is all about the data; which means that data scientists need to keep in mind where a particular dataset is coming from, if that source is reliable, and whether they have permission to use that data. (This is especially important as governments begin to crack down on data usage worldwide, most notably the EU’s GDPR laws.)

In addition to programming languages and an intuitive grasp of data, there are some key skills that any data scientist must master. On the most fundamental level, data scientists must be highly skilled in statistics. Then add in programming and using data-oriented tools.

Here are some additional tools and technologies you would use as a data scientist; these are all tools you can start right now learning on your own:

  • Python. This is a programming language that’s been around for over three decades, but has lately proven to be ideal for processing data.
  • Python libraries. These are sets of code written in Python to help you do your work. Such libraries include numpy, matplotlib, pandas, and seaborn. While learning python, you can look up these library names, and start practicing with them.
  • SQL (which stands for Structure Query Language, and is usually pronounced like ‘sequel’). This is a decades-old language that is still used for storing data in tables and connecting the data through relationships.
  • Tableau. This is a tool that’s required for data scientists that helps analyze data and then present it in graphs and charts. This is a skill called data visualization. Data visualization involves plotting data in different types of graphs and representing the data in a way that non-scientists can easily understand.
  • Power BI. This is another data visualization tool that’s created by Microsoft.
  • Jupyter Notebooks. This tool lets you run python code right in the browser, and then run visualization tools and see charts and graphs appear right below each set of code.

What else do data scientists need to know? For an even deeper breakdown, let’s turn again to Burning Glass, which data-mines data scientist job postings for insights into what companies want. Specifically, it divides skills into three categories:

Distinguishing skills (advanced skills called for occasionally) that truly differentiate candidates applying for various roles. As you might expect, there’s a lot of education and training necessary to master these.

Defining skills are the skills needed for day-to-day tasks in many roles.

Necessary skills are the lowest barrier to entry; they are also skills that are often found in other professions, providing a springboard for people to launch into a data-science career. 

Here’s the breakdown:

To dig into things a bit more, here are some of the additional platforms and tools that data scientists use on a regular basis as they sort through terabytes (and sometimes petabytes!) of data—some of it unstructured, sadly—for understanding:

Given the rapidly evolving nature of the data-science field, new tools are coming online all the time, so experienced data scientists know to keep an eye on what their colleagues are using. As data scientists progress through their careers, they also develop a sense of intuition about data that allows them to glean insights from massive datasets that others cannot. However, it can take years in the industry—and untold numbers of data-science projects—to reach that particular point. With data science, experience is key.

The combination of analytical and soft skills makes data scientists insanely valuable to an organization. But with so much opportunity out there, it’s sometimes difficult for a budding data scientist to decide which industries and problems to focus their attention on. “The decision to specialize in a couple of distinct subject areas, versus becoming a generalist, is a matter of curiosity, interest, ambition, ability, or simply time,” said Yuri Bykov, Dice’s Director of Data Science. “But whatever the breadth and depth, playing an active part in the value chain, transforming data into something useful (whether for the organization, science, or humankind) is the key reason for the data scientist’s existence.”

Whatever their data mission, all data scientists must know one key thing: how to build out and analyze a dataset. Which brings us to...

The Data Pipeline Process

Here are the steps data scientists take for building a dataset:

Gathering the data: Data scientists usually start by gathering data from different data sources. This step is often known as extraction.

Cleansing the data: Although we’d like to believe that the data we receive doesn’t have any errors, in reality many things can go wrong that will cause data to be lost or incorrect. And that means the data must be cleansed. Bad data is either removed or corrected; missing data is either ignored or corrected.

Transform the data: The data might not be in the form you need it. For example, weather data might come in as Celsius but you need to convert every item to Fahrenheit. Or the data might come in grouped as sales orders for each customer, but you need just sales dates and quantities for each individual product. This requires going through the entire dataset and reworking it into a form you need it in with the help of automated tools meant for processing large sets of data.

Join the data: This is technically still part of transformation, but in this step, the data scientist connects data points from multiple sets (such as car diagnostic points with the current weather when the problem occurred).

Load the data: In this step, after the data is in its required form, it gets saved in some place, typically in the cloud if it’s large. (And notice now this data can also now be used in future projects as a new source of data.)

The above is called the ETL pipeline: Extract, Transform, Load. This is only the start of the process; as a data scientist you’ll then begin your analysis to find what you need, whether it’s why the cars are breaking down, or how many items to manufacture for a particular month.

And finally, you’ll need to share the findings with people where you work. You’ll use different tools for presenting the data graphically and through lists in ways people can understand. (And this step also requires a bit of creativity as you’ll want to make sure the graphs are visibly appealing so that the people will want to explore the information in them.)

Getting into Data Science

But how does someone actually break in as a data scientist in the first place (here's some helpful data science certifications in your quest to acquire skills)? Fortunately, companies of all sizes have rushed to seize data-science talent over the past few years, meaning there’s lots of opportunity out there. As you might expect, data science job interviews tend to focus on skills and programming languages, with questions meant to explore your ease with data analytics (as well as your ability to draw conclusions). Technical questions might include:

  • How you can build a predictive model in the absence of labeled data (using unsupervised ML techniques, or keyword-based approaches to generate labels)?
  • What did you do on a daily basis? What did you do on the team?

You'll also inevitably face questions designed to test your ability to think contextually and holistically. It's not just about using tools on a massive dataset; do you have the acumen to apply whatever you find to the needs of the broader business? Questions along these lines might include:

  • What are some of the data sources you would use?
  • How are you going to acquire data, and what’s the end goal?
  • How do you structure the entire project around those considerations before adopting a methodology?
  • Can you take the problem back and think about it more holistically?

This rush for data-science talent is great for those job hunters with the right skills, although it’s created something of an “arms race” among corporations. Large firms such as Walmart have the resources to build out enormous teams dedicated to analyzing petabytes of transactional data for insight, and financial institutions will pay $1 million salaries to data scientists capable of generating profitable insights. That sort of activity by the world’s largest (and most well-monetized) companies can make it difficult for smaller firms with relatively paltry budgets to compete for the best talent (although new generations of data-analytics tools can sometimes allow even the most cash-strapped firms to analyze datasets without needing to hire a room full of data scientists).

Data science touches virtually every industry. In finance, for example, data scientists and machine-learning specialists must crunch huge amounts of market data for insights that will make banks and other institutions tons of money—and lately, there’s been an emphasis on data-analytics skills over specific finance knowledge and experience, opening the door to more data scientists to break into the industry. (Knowing Python is also considered a major asset in finance-related data-crunching; more on that programming language later.) 

Data science is also having a substantial impact in medicine, because companies want to analyze everything from drug-study datasets to patient information; in retail and sales, where only data will provide the insights that companies need to differentiate themselves in an aggressively changing marketplace; and transportation—cities need public transportation optimized, and tech firms want help building the next generation of autonomous-driving software, which requires data science.

Within specific industries, a number of job roles provide an ideal springboard for a data scientist track. For example, within software engineering, software engineers and developers with expertise in many languages (including Python, Java, and .NET) will find that they already have the skills they need to transition into their first data scientist role. In finance, quants, risk analysts, and financial analysts likewise have the tools and training necessary for a data science career.

Those in business intelligence have an even tighter pathway—business analysts, developers, and engineers are already utilizing the mindset and skills of data scientists as they dig through volumes of corporate data for insights. It’s a similar situation with Big Data specialists such as Hadoop developers, who are already wrangling huge datasets. Then you have academia, stuffed as it is with researchers, statisticians, economists, mathematicians, and physicists. Since academics are already trained to pull in data to find the “big picture,” they’re often very well-suited for data science careers.

From Data Analyst and Data Engineer to Data Scientist

Although some people use the terms “data scientist,” “data analyst,” and “data engineer” interchangeably, there’s actually quite a bit of difference between these positions. In fact, many folks start out as data analysts or data engineers before deciding to plunge into data science full-time. Let’s break down some terms:

Data Scientist: Data scientists combine statistics, Big Data crunching, analytics tools, and even machine learning to transform massive datasets into things that a business can actually use. As mentioned above, they are prognosticators, using data to inform a company’s strategy.

Data Analyst: Data analysts are tasked with analyzing data for insight, often on a much more “tactical” and smaller scale (especially in terms of computational requirements) than that of data scientists. But like data science, data analysis is a complicated job that often involves communicating with multiple stakeholders, which means an emphasis on “soft skills” such as communication. A “typical” analyst might work with end-users to figure out what they need from the data; analyze the data; glean insights; and then communicate those insights to the larger organization.

Key data analyst tools include:

Data Engineer: Just like a “traditional” engineer might build a dam or some other massive piece of infrastructure, data engineers construct and maintain (often massive) repositories for data, such as the customer-information databases that large companies use. They also monitor the movement and status of data through these systems, which can mean tagging and cleaning huge datasets as they become available. Their preferred tools include:

For those engineers and analysts who decide that they want a more strategic role within their organizations, becoming a data scientist is a good career choice. After all, they already have many of the data-centric skills necessary for a scientist role. Plus, organizations despite for data-science talent might prove amenable to paying for whatever classes necessary to elevate an engineer or analyst to data scientist; it's potentially easier and cheaper than sourcing someone from the outside.

Data Scientist Career Paths

For those who decide to take the plunge into the data-science industry, there are a few possible career arcs. For many, getting a master’s degree in some aspect of data science, predictive analytics, machine learning, or statistics is a good way to establish a presence in the industry.

Data Scientist Education

But while universities offer plenty of opportunities to learn the fundamentals of data science, not everyone has a chance to spend three or four years in class. With that in mind, there are also data science bootcamps, such as NYC Data Science Academy and Data Science Dojo, that offer accelerated courses; be warned, however, that those bootcamps are often expensive and intense, and you need to be ready to learn a lot of information very, very quickly.

If you’re good at self-directed learning, you can also check out the following online courses:

Harvard: The lecture notes for CS109 Data Science is online, complete with slides and videos. This course breaks down everything from machine learning and Python to projects (there are also some guest lectures).

Udemy: This online-learning platform offers a course in Python for Data Science and Machine Learning, which is important given Python’s increasing importance to many aspects of data science. This course includes crafting algorithms, visualizations, and more.

Coursera: This Data Science Specialization course features a focus on R, using GitHub to manage data projects, and analyzing data.

EdX: Founded by MIT and Harvard, EdX offers a (pricey) online-learning curriculum. Its MicroMasters Program in Statistics and Data Science tackles some big topics, including probabilistic modeling, machine-learning algorithms, and deep neural networks.  

Metis: This Introduction to Data Science course isn’t for strict beginners—it asks for a background in basic statistical concepts, as well as Python knowledge—but it is comprehensive.  

Fortunately, employers desperate for data scientists are often willing to foot the bill for you to attend a bootcamp or similar course; ask your boss if this is an option.

If you're still exploring whether a data scientist career is right for you, there are also lots of online tutorials and documentation that will allow you to learn about the profession at your own pace, including:

Career Paths

With the maturation of artificial intelligence (A.I.) and machine learning, data scientists also have another potential career path that heavily leverages machine learning and artificial intelligence. Given the excitement at many firms over the potential of machine learning, it’s no surprise that data scientists who embrace these technologies can find their careers fast-tracked; but as anyone who’s explored A.I./ML will tell you, there’s a lot of education involved—be prepared. 

As machine learning and artificial intelligence become a more integral part of data science, you’ll also see data scientists and machine-learning engineers freely jumping between more advanced roles: a machine-learning/A.I. engineer could jump into a data science manager/architect role before becoming a director of data science, while a data scientist could hop over to a senior data scientist role that involves a heavy dose of machine learning, before moving from there to senior management.

How do actual data scientists progress through their careers? What are their biggest challenges, and what do they consider important? We asked a few at different points in their career arc for the full picture.

Data Scientist Entry-Level Roles

If you’re just starting out a career in data science, it’s important to seize any opportunities you can to break into the industry—and that can mean everything from participating in forums, to going to meet-ups, to competing in data-centric contests. Take the case of Balázs Gődény, who used to work for software companies in various positions, ranging from test team leader to principal software architect. More than 10 years ago, he joined Topcoder, a crowdsourcing community of developers and data scientists, and started participating in long-competition formats (called “marathon matches”).

At the time, he had a full-time IT job unrelated to data science, but found participating in the competitions was refreshing, giving him intellectual challenges and satisfaction beyond what he was getting where he worked. “Most of the Topcoder competitions that I have seen can be classified as data-science related,” he said. “They involve extracting information from huge data sets, finding correlations, making sense of data, and creating data-driven solutions to problems that are hard for humans to perform efficiently.”

The whole process, starting with the client's needs and ending with verifying whether the members' solutions actually meet those needs, gave him a very good overview of the whole data science landscape. A similar path could prove useful for anyone interested in a data-science career.

Gődény has also served as a researcher for a company specializing in natural language processing, artificial intelligence, and computational linguistics (i.e., the interactions between computers and natural languages). “We had to solve tasks that seem trivial for a human, but actually very challenging for a machine, like sentiment analysis: judging from a piece of text whether it describes its subject in a positive or negative way,” he said. “Think of irony as a good example when such judgment is very hard to make, sometimes even for us, as humans.”

Gődény said that, for junior data scientists, common sense, some math, modeling skills and basic problem-solving abilities are much more valuable than being an expert in any particular technology: “Knowing any piece of technology is only a tool that can be used to perform a certain task, and each task requires different tools, so the capability of working with new tools is much more important than the knowledge of any specific tool.”

Mid-Level Role: Machine Learning Engineer

Machine learning and artificial intelligence (A.I.) are new and burgeoning fields, but data scientists are already establishing career pathways that leverage these technologies. Costas Boulis, chief scientist at Bright Machines, a San Francisco-based company tackling industrial automation with robotics and software, calls machine learning “core to the mission” of the company.

“We are making robots that require less supervision from humans and are able to adapt to the variations and changes that they would encounter during manufacturing,” he explained. “In other words, we are making robots more autonomous. Machine learning is the main way that robots can learn to grow their autonomy.”

For up-and-coming data scientists, the key thing to remember is that machine-learning solutions must be designed end-to-end. Critical skills include the ability to create and evaluate models, deploy models to production, create appropriate monitoring and logging of the model decisions, and quickly visualize lots of data. “Then of course there is the biggest skill of all—the ability to communicate the value of the solution to non-technical stakeholders,” he added.

Boulis said that, when trying to teach autonomy to a robot (or a child, for that matter), you cannot list all the ways they should react to all the situations that could arise. “It is impossible to program a robot to be autonomous the way programs are currently written as a sequence of commands authored by people,” he said. “We need to reverse the situation; we need to present the output and have the robot come up with the steps that would eventually lead to that output.”

Boulis said his goal is to continue expanding the diversity and value of machine-learning solutions. “It’s hard to find a more fascinating machine learning problem than finding ways to make robots more autonomous.”

Other Data Scientist Mid-Level Roles

Once a data scientist has built up a little bit more experience, they can begin targeting mid-level positions. In this “middle tier,” soft skills such as communication begin to come into play just as much as “hard” analytical ones. John Taylor, manager of analytics and data science at Iovation, says he acts as a business-savvy subject matter expert for different types of data analytics technologies, including machine learning. “Broadly, I aid the data science team in applying methods to valuable authentication and fraud prevention use cases within various budgets,” he said.

Taylor, who has been with Iovation for eight years, leads a team that prototypes, improves, and optimizes data products to combat fraud. “We are locked into an evolutionary arms race with fraudsters, which produces complex and ever-changing risk patterns,” he explained. “This necessitates the application of both human and machine intelligence to be effective.”

Taylor noted that he needs a “broad array” of technical and soft skills to do his job well. “On the technical side, I need mathematical, statistical, data handling, and software engineering skills,” he said. “On the soft side, I need creativity, critical thinking, curiosity, and problem-solving, communication, and business skills.”

Understanding parallelization, scalability, and complexity analysis is also important in this environment, especially since machine learning is applied within the tight time-budget of a real-time data flow. Data scientists who have risen into the middle tiers of their profession are usually skilled enough to accomplish tasks quickly; they can also increasingly rely on their intuition to draw conclusions about datasets and results.

“My goal is to continue to leverage my data science expertise in more impactful ways,” Taylor said. “My current position bridges the strategic and tactical perspectives, allowing me to both shape data products that align with our corporate strategy and to provide data insights to our business direction. Ultimately, this will likely lead to strategic data science positions.”

Data Scientist Advanced-Level Roles

In order to climb to the very top as a data scientist, you must demonstrate an ability to guide teams and confidently oversee strategic data analyses of all types. You must also stay aware of the latest technologies.

Prateek Jain works as director of data science at AppZen, and his team is working on solving complex problems in financial auditing, using cutting-edge techniques from natural language processing to computer vision. Prior to joining AppZen, Jain worked for the IBM Watson Research Center and Nuance Communications; he holds a PhD in computer science and engineering from Wright State University in Dayton, Ohio. “In both places, my roles and responsibilities were working on impactful projects for apps in a commercial setting, and AppZen reached out to me a year ago,” he said.

His role at AppZen involves a few different components, including the identification of strategic projects that are useful for customers. He also helps the company grow through the application of technologies such as deep learning and computer vision. His other (critical) role is to hire talented individuals and mentor them.

“I’m responsible for reaching out to candidates, interviewing them, understanding the new technologies and how it fits into our company,” he said. “That includes looking into the current state of the art, reaching out to authors behind the research work, and engaging in a conversation with them.”

He explained that, by keeping up with cutting-edge technologies such as machine learning, he can deliver meaningful results. Those technologies can also draw hot new talent to the company: Many candidates in the data science field are looking for opportunities to work with the next generation of platforms and tools.

“My goal is to constantly become better at what I do, while at the same time help people get started in their career. My role at AppZen puts me in the unique position to have the first component, and the problems we are trying to solve are so complex that I have the opportunity to engage with people just starting their career,” Jain said. “I’m really lucky to be able to satisfy both of my desires with respect to my career.”

While running a company’s entire data-science operation can prove a difficult job, it’s also a rewarding one for those with the right combination of skills. You’re playing with the biggest and most powerful tools on the biggest stage—and the decisions you make can determine whether a company succeeds or fails.

Data Scientist Salaries

A data scientist salary correlates with years of experience, as well as specialization. In short, the longer you stay in the industry, and the more skills you learn, the more you advance and the bigger your paychecks. We analyzed Dice’s database and came up with these salary numbers for various data-science roles, as well as the average length of time a data scientist spends in each position. Let's start with salary:

The good news is, it's very possible for someone on a data-scientist track to attain a lucrative position within a relatively short period of time. But when it comes to higher earnings (and job security), specialization and skills are key. As data science becomes “sexier” as a career path, more people are attracted to it, and the talent pool swells. Standing out amidst that pool depends on your ability to do what others simply can’t.

It’s clear that education is key when it comes to data-science positions. According to Burning Glass, some 60.3 percent of those who are already data scientists have a bachelor’s degree, while 30 percent have a master’s degree; another 9.7 percent have a doctorate.

When it comes to experience, some 53.6 percent have been in the data-science industry for 3 to 5 years, while 20.7 percent have been there less than two years. Another 16.4 percent have been around for 6 to 8 years, and a mere 9.3 percent have been there for longer than a decade—not surprising, given the relative newness of the profession.

Data Scientist Demand

But your typical data scientist can’t rest on their proverbial laurels, safe in the assumption that companies are ravenous for anyone who can crunch data. In early 2019, a blog posting by Vicky Boykis, senior manager for data science and engineering at CapTech Ventures, hypothesized that an influx of new people to data science might be leading to an oversupply of talent. “Based on my own participation as a resume screener, mentor to data scientists leaving boot camps, interviewer, interviewee, and from conversations with friends and colleagues in similar positions,” she wrote, “I’ve developed an intuition that the number of candidates per any given data science position, particularly at the entry level, has grown from 20 or so per slot, to 100 or more.”

An analysis of Dice job postings shows that postings for data scientists rose at a steady rate between 2016 and 2018, with a huge spike early last year—only to collapse back to 2017 levels by summer. After that, employer demand for data scientists began to crawl higher again—but much more slowly than before. If the market is indeed becoming saturated (despite the demand), it behooves anyone interested in becoming a data scientist to think about specialization early and often.

Meanwhile, demand for data scientists continues unabated, with California leading the way when it comes to job postings over the past 12 months, median salary, and time needed to fill various data-scientist positions (itself a sign of high demand). Here's the full breakdown of the top states:

Healthcare, technology, and consulting have the biggest need for data-science talent, based on their job postings. And that makes total sense: All of these industries wrestle with the need to glean useful insights from epic amounts of data. If you're hiring hundreds of data scientists per year, data is clearly at the forefront of your strategy as an organization. If you're just starting out your data-science career, you can look at statistics like that and take heart; this is a job that will most likely attract demand for quite some time to come.

That demand means that data scientists can also rise relatively quickly up the career ladder. Take a look at Dice's breakdown of the average years of experience required for each of the following roles:

Indeed, there are lots of opportunities to enter and build a career as a data scientist—provided you have the right mix of skills and experience.

Nate Eddy and Jeff Cogswell provided extensive reporting for this article.