How We Data-Mine Related Tech Skills

One of the earliest and most interesting data science projects we’ve embarked on is automatically learning which professional skills relate to one another, based on our data. For instance, if someone lists data science as a skill, it’s likely they also know machine learning, R or python and Hadoop.

This project has a lot of potential uses on our site, from suggesting related skills when a user adds technology skills to their profile, to enhancing our search and job recommendation platforms by matching across related skills. This can also be used to suggest related keywords when a user is typing in a search query:


How do we determine if two different words are related? A research area in the field of Natural Language Processing, called “Distributional Semantics,” addresses this problem. Distributional Semantics attempts to determine the meaning of a word by examining the different contexts in which it occurs. To quote John Firth, a pioneer in this field, “a word is characterized by the company it keeps”; by looking at the words that occur within the same context as the original word, such as the same sentence or document, you can understand the word’s meaning.  That means if you look at a lot of documents about cars, you’ll see terms such as “tires”, “wheels”, and  “automobiles” occurring in a lot of the same documents. You can then use these co-occurrence statistics to determine which terms relate to one another.

Check out the latest data science jobs.

So how did we turn this concept into an algorithm that can learn related technology skills? First we used a parser we developed to extract technology skills from job postings. Then we constructed a “Term-Document Matrix” from these lists of skills. Also called a “Vector Space Representation,” this matrix features a row per skill, along with a column for each document (or job in this case), as illustrated below:

Matrix Image


Incidentally, this is the data structure used by search engines for performing Web searches. In this format, each skill is represented as a vector (a list of numbers) encoding the documents in which it occurs (hence the term ‘Vector Space Representation’). However, a skill is clearly more important to some documents than others; for instance, Java is a more important skill to a Java developer than to a Scala or Hadoop developer. To reflect this, we use a weighting scheme called tf-idf weighting also borrowed from the field of Information Retrieval, which studies search engine design.

The tf-idf calculation multiplies a measure of term frequency (tf)—how often a word occurs in the current document—with inverse document frequency (idf), which measures how rarely the word appears in a set of documents. The intuition behind tf-idf is that words that are important to a document occur in that document much more frequently than they do in most other documents. A lot of prior research has shown that weighting words in this way is very effective for determining which words best represent a document.

Now that we have tf-idf vectors, we can compute similarities between pairs of skills by computing the similarity of their vectors. However, using the raw tf-idf vectors poses some problems. There is a lot of inherent noise in the data, as is typical with textual data, and we’d like much smaller vectors to work with. We typically take a sample of 100,000 jobs to compute skill similarities, which means each vector has 100,000 elements. We can solve these two problems by performing a dimensionality reduction on the data, reducing the number of elements per vector from 100,000 to 1,000, using an algorithm called truncated singular value decomposition. Dimensionality reduction can be considered a form of lossy data compression, wherein we compress the data by removing noisy data points that capture the least variation in the data, and combining columns that are very similar to form the new columns. This is important for matching rare skills. To create the term-document matrix and perform the dimensionality reduction, we use the python gensim package and the gensim LSA module, but we do the dimensionality reduction over the document and not the term dimension, which is how LSA normally works. LSA is computationally intensive, but gensim uses an iterative approximation, which is very fast and scales very well.

Once we have these 1,000 dimensional vectors for each skill, we can use the cosine similarity metric to calculate how similar two skills are from their vectors, and thus compute the most similar skills for a given skill. Here are some examples of skills with their five most similar related skills:

Data Science: Machine learning, Apache Flume, HDFS, Big data, Product marketing

Linux: Unix, Linux administration, Red Hat Linux, C++, Apache HTTP Server

Java: J2EE, Spring, Hibernate, Web services, Apache Struts

HTML: CSS, JavaScript, jQuery, Web development, UI

Networking: Cisco, Network engineering, Network management, Hardware, Cisco Certifications

Project Management: PMP, Project planning, IT project management, Budget, Microsoft Project

I think that’s pretty cool, given we’re generating that automatically from job descriptions posted on our site. We also tried using the resume dataset, but the results were of a lower quality, as the skills extracted from resumes can be from different jobs.

Another interesting thing: we now have a way to compute the similarity of skills, by clustering groups of similar skills into groups of related items. We can use an off-the-shelf clustering algorithm to cluster the skill vectors by their cosine similarity; for this, I used affinity propagation, a popular graph-clustering algorithm because it’s known to work well on short documents. Unlike other clustering algorithms such as k-means, where you have to specify the cluster size, affinity propagation automatically discovers the number of clusters in a dataset. Here’s some example clusters, with some descriptive labels:

Microsoft BI Stack: MDX, Microsoft BI, Microsoft SSAS, Microsoft SSIS, Microsoft SSRS

Design Skills: Adobe CS, Adobe InDesign, Animation, Brand, Graphics, Interface design, Logos, Typography

Java Frameworks: Apache Struts, GWT, Hibernate, J2EE, Java, Spring, Spring MVC

Big Data NoSql: Apache Cassandra, Big data, MongoDB, NoSQL, Scalability

Web Technologies: Ajax, CSS, HTML, HTML5, JavaScript, Responsive design, UI, Web development, jQuery, jQuery UI

Data Science AI: Artificial intelligence, Data mining, Data science, Machine learning, Natural language processing

Simon Hughes is the chief data scientist of the Dice Data Science Team.


Firth, J. R. (1957). A synopsis of linguistic theory 1930-1955. Oxford Philological Society.

Upload Your ResumeEmployers want candidates like you. Upload your resume. Show them you’re awesome.

5 Responses to “How We Data-Mine Related Tech Skills”

  1. Fred Dietz

    You have just documented the problem with recruiting today and our society in general. Organizations from the US Gov. to the local blood drive want to database and then categorize everyone and everything. Recruiters want to put everyone into a group. Be it CYA or I can only do one thing so that means everyone can only do one thing mentality, it frustrates the best and brightest. A lot of people are good at a number of complex tasks that cross groups such as those in the article. Yes, I know the article is trying to illustrate how skill grouping enhances searches but experience shows that’s not how most recruiters think. Recruiters, if nothing else, are experts at CYA. If a candidate doesn’t work out they want to be able to say: “well, they did that exact task at Wingnutt Industries for 3 years.”
    The best and brightest like to learn and enjoy the challenge of mastering new skills. If you don’t fit neatly in one of these groups or worse, fit into a lot of them, chances are the typical recruiter is going to pass you by. Unfortunately, that leaves a lot of the best and brightest out in the cold while those happy doing the same tasks year after year are rewarded. Nothing wrong with getting good at something and sticking with it but all the while CEOs lament the lack of innovation within their company. DUH!

    • All I describe in the article is a method of determining which skills are related. I show clustering as an illustration of one use of this technique, there are many others. I think you miss-understand the nature and point of the article. This technique can relate to any skill, rare or common. It does not force people into buckets, just allows us to broaden the search by finding similar skills to the set a person already has. That greatly broadens the diversity of candidates that can be found, rather than narrowing it by allowing people with unusual combinations of skills to appear in searches where they would previously be ignored as they don’t fit some skills template.

  2. I got here from yesterday’s post “Dice Data: How Technical Skills Connect” by Yuri Bykow’s post. How you decide what a “technology skill” is? I.e. how did the document parser make that decision?
    I’m assuming here that it can’t simply be a matter of choosing the most common words from job adverts as I assume that would lead to the inclusion of words common in job adverts which are not skills (e.g. “job”), and to the exclusion of rare skills.
    I’m asking because I’m curious how “translation”, “localization” and “internationalization” connect to other skills (I’m doing research on interdisciplinary collaboration) but unfortunately couldn’t find it in the graph. I would argue that “localization” is not less of a technical skill than “copywriting” and appears in a few job postings. “Internationalization”, I would argue, is fairly technical.
    Anyway, a very interesting posting. Any plans to make a peer-reviewed publication out of this?

  3. Thanks for the article. It does help in one of my works which is related to the same topic. Are the complete results available for viewing? I was wondering if there exists a published baseline for the semantic similarity for technical skills. That would greatly help in comparing and assessing new approaches or improvement. Would be helpful if you could share pointers to research works which establish a fair baseline. If this article evolved to a paper, would be excited to read about it.