Opening Day? Analytics Tools to Predict the World Series Now

Billy Bean and Theo Epstein are right: Baseball isn’t all about skills, tactics and chewing tobacco bubble gum, it’s about numbers, and maybe even numbers first.

It’s Big Data in the big leagues, baby. Which means if you’re trying to figure out how optimistic — or nervous — you should be about your team’s performance this season, analytics can help. For those of you who don’t want to crunch your numbers via nostalgic pencil and paper or Microsoft Excel, there are a whole lot of tools out there to use, like SAS, Cognos and dozens more. They’ll help you get the most out of the models and algorithms you’ll need to sweep your competitors in your fantasy baseball league, or decide what kind of mood you should get ready for come the All Star break.

Baseball Loser

As an example, here’s how sports number-crunching junkie Jasper Snoek approaches it:

Having the right requirements for the analysis up front is to be is key — there are many algorithms out there, all of which have their unique differences, and everyone has their own thoughts about what makes for the best predictions and calculations. For instance, you can create factors for baseball (or any other sport) teams to predict their game scores. Here the factors learned by your model could correspond to offensive skill and defensive capabilities, for example. You might consider taking all of the team’s scores from previous past season and then plug them into the model along with team in each match-up is the home team and which is the away team, then adjust the team factors and game factors in order to maximize the probability of the real score.

Or, you can try a model built around these data points:

  • RPI (Rating Percentage Index), which ranks teams based on wins and loses, as well as “strength of schedule.”
  • Wins against top 25 teams.
  • Wins against teams ranked 26-50.
  • “Neutral” venue wins.
  • Record and rank in conference.
  • Strength of conference.
  • Sagarin rankings from USA Today

Of course, to make any of this happen, you’ve got to develop some sort of algorithm. One example is what’s called a “machine learning algorithm,” a branch of artificial intelligence that’s about the construction and study of systems that can learn from data. For example, it could be trained to learn to distinguish between spam and non-spam email messages. Then it can act by sorting messages into spam and non-spam folders as they come in.

The core of machine learning deals with representation and generalization. Representation of data and functions evaluated on instances — like email delivery — are part of all of these systems. Generalization is the property that they’ll perform well on unseen data instances. Pattern recognition and computation of probabilities is also a common concept that’s applied here.

Similar types of algorithms are used to predict all kinds of outcomes, like weather or the possibility of tornadoes. The algorithms and models that include the best set of predictors will be the ones that make the most accurate predictions.

Which, ironically, brings us back to where we started. The Billy Beans and Theo Epsteins of the world, who changed the game by making data the focus of their team building and game day management, have to rely to a degree on their instincts and human expertise as they built their models and algorithms. In the end, their lineup for each game was created because they included the the right statistics and predictors, not the observed skill of the players they were based on.

Good luck this season.