Data Scientist Training: Resources and Tips for What to Learn

Data science is a complex field that requires its practitioners to think strategically. On a day-to-day basis, it requires aspects of database administration and data analysis, along with expertise in statistical modeling (and even machine learning algorithms). It also needs, as you might expect, a whole lot of training before you can plunge into a career as a data scientist.

There are a variety of training options out there for data scientists at all points in their careers, from those just starting out to those looking to master the most cutting-edge tools. Here are some platforms and training tips for all data scientists.

Just Starting Out? Consider These Resources.

Kevin Young, senior data and analytics consultant at SPR, says that many data scientists treat Kaggle as a go-to learning resource. Kaggle is a Google-owned machine learning competition platform with a series of friendly courses to get beginners started on their data science journey.

Topics covered range from Python to deep learning and more. “Once a beginner gains a base knowledge of data science, they can jump into machine learning competitions in a collaborative community in which people are willing to share their work with the community,” Young says.

In addition to Kaggle, there are lots of other online resources that data scientists (or aspiring data scientists) can use to boost their knowledge of the field. Here are some free resources:

And here are some that will cost (although you’ll earn a certification or similar proof of completion at the end):

This is just a portion of what’s out there, of course. Fortunately, the online education ecosystem for data science is large enough to accommodate all kinds of learning styles.

Building Familiarity with Data Structures, Analysis

Seth Robinson, vice president of industry research at CompTIA, explains that individuals near the beginning of a data science career will need to build familiarity with data structures, database administration, and data analysis.

Database administration is the most established job role within the field of data, and there are many resources teaching the basics of data management, the use of SQL for manipulating databases, and the techniques of ensuring data quality. “Beyond traditional database administration, an individual could learn about newer techniques involving non-relational databases and unstructured data,” he adds.

Training for data analysis is newer, but resources such as CompTIA’s Data+ certification can add skills in data mining, visualization, and data governance. “From there, specific training around data science is even more rare, but resources exist for teaching or certifying advanced skills in statistical modeling or strategic data architecture,” Robinson says.

Two Groups of Data Science Training

Young cites two main segments of data science training: model creation and model implementation.

Model creation training is the more academic application of statistical models on an engineered dataset to create a predictive model: This is the training that most intro to data science courses would cover.

“This training provides the bedrock foundations for creating models that will provide predictive results,” he says. “Model creation training is usually taught in Python, and covers the engineering of the dataset, creation of a model and evaluation of that model.”

Model implementation training opportunities cover the step after the model is created, which is getting the model into production. This training is often vendor or cloud-specific to get the model to make predictions on live incoming data. “This type of training would be through cloud providers such as AWS giving in-person or virtual education on their machine learning services such as Sagemaker,” Young explains.

These cloud services provide the ability to take machine learning models produced on data scientists’ laptops and persist the model in the cloud, allowing for continual analysis. “This type of training is vital as the time and human capital are usually much larger in the model implementation phase than in the model creation phase,” Young says.

This is because when models are created, they often use a smaller, cleaned dataset from which a single data scientist can build a model. When that model is put into production engineering teams, DevOps engineers, and/or cloud engineers are often needed to create the underlying compute resources and automation around the solution.

“The more training the data scientist has in these areas, the more likely the project will be successful,” he says.

Training Remotely Gains Traction

Young says one of the lessons learned during the pandemic that professionals in technology roles can be productive remotely. “This blurs the lines a bit on the difference between boot camps compared to online courses as many boot camps have moved to a remote model,” he says. “This puts an emphasis on having the ability to ask questions to a subject matter expert irrespective of whether you are in a boot camp or online course.”

He adds certifications can improve organizations standing with software and cloud vendors. “This means that candidates for hire move to the top of the resume stack if they have certifications that the business values,” Young says.

For aspiring data scientists deciding between boot camps versus online courses, he says probably the most important aspect to compare the two are the career resources offered. “A strong boot camp should have a resource dedicated to helping graduates find employment after the boot camp,” he says.

A Lifetime of Learning—Reimbursed by the Organization 

Robinson adds it’s important to note that data science is a relatively advanced field.

“All technology jobs are not created equal,” he explains. “Someone considering a data science career should recognize that the learning journey is likely to be more involved than it would be for a role such as network administration or software development.”

Young agrees, adding that data scientists need to work in a collaborative environment with other data scientists and subject matter experts reviewing their work. “Data science is a fast-developing field,” he says. “Although fundamental techniques do not change, how those techniques are implemented does change as new libraries are written and integrated with the underlying software on which models are built.”

From his perspective, a good data scientist is always learning, and any strongly positioned company should offer reimbursement for credible training resources.

Robinson notes in-house resources vary from employer to employer, but points to a macro trend of organizations recognizing that workforce training needs to be a higher priority. “With so many organizations competing for so few resources, companies are finding that direct training or indirect assistance for skill building can be a more reliable option for developing the exact skills needed, while improving the employee experience in a tight labor market,” he says.