In many Big Data analysis blogs, at Big Data meetups and in the halls of the most recent O’Reilly Strata Conference, one of the most-discussed topics is which language is better for data analysis: Python or R. Some of the talk has even reached “religious” overtones not unlike previous discussions on Windows vs. Linux or Microsoft’s Internet Explorer vs. Mozilla Firefox.
So what’s the issue here? Why are Big Data analysts so concerned with what language to use? In my honest opinion, the root of the issue probably has more to do with the tools they learned on than anything else. But let’s briefly look at each.
Python is a general purpose scripting language which can do many things, from complex data processing and data munging to the implementation of mathematical and algorithmic functions for machine learning. Many developers are comfortable with Python since it’s easier to learn than R.
As Python is a scripting language, it allows the data analyst to easily play around with data sources and data parsing ad-hoc without using a formal programming model. With the use of other libraries you can do text mining, vectorizethe text data and identify similarities between posts and texts.
Python also has an OOP model, so having an OO language in your tool kit allows you to program structured and modular applications should that be your choice. This can be seen as an advantage over R.
R is an extremely rich environment, especially when you get into statistics. Inference, statistical modeling and then plotting your data on a bar, pie chart and histogram is trivial in R, as it’s formatted for statistical modeling using vectors and/or matrices.
As R was created by statisticians for statisticians, someone who has a general knowledge of statistics usually finds it exceptionally easy to master. Programmers of other languages also seem to have an easy time learning and using it.
If you’re a data analyst who wants to see data distributions before drawing conclusions, R allows you to visualize outliers and data density. For probabilistic problems and distributions, and linear regression problems, R’s ease of use of data manipulation using vectors and matrices makes life exceptionally simple.
With R’s statistics-rich library of algorithms, there’s no need for understanding the specifics of data types, as would be required with Python. It has tremendous following and support, especially from the academic and commercial statistics communities, and now the Big Data analytics community.
Python vs. R?
Should you use one over another in Big Data analytics? I think that both are valuable and you should examine specifically what problem you’re trying to solve. Both Python and R need to be in the data scientist’s and data analyst’s tool box, and a skilled Big Data professional should be ready to use either, depending on the problem they’re working on.
A recent survey of data scientists and data miners by KDNuggets found that “R has a solid lead, and was used by about 77 percent of the voters. Python was used by about 32 percent of voters.” When it comes to pay, the data scientists and data analysts who had the highest salaries knew R, according to Dice.
Is R better than Python? For some things. From a systems performance standpoint, it seems that the performance of R and Python is very much the same.
An Alternative: Julia
What is Julia? It’s “a high-level, high-performance dynamic programming language for technical computing.” It naturally has many, many of the mathematical and statistical libraries found in any high performance environment. It’s also very extensible: There’s a built-in package manager for the addition of new external libraries and packages.
Julia is built for speed. Applications using it rather than Python or R have been found to be ridiculously fast. Here are some comparisons from the Julia Language website:
How do programs written in Julia run so fast? Because of its LLVM-based just-in-time (JIT) compiler, which is designed for a high performance environment. Julia is also designed for cloud computing and parallelism as it provides a number of key building blocks for distributed computation. That makes it flexible enough to support a number of styles of parallelism, and allows users to add more.
A number of MIT video tutorials for learning Julia are located here.
Will Julia replace Python or R? Not yet, since some libraries useful in performing Big Data analysis are just not available. However, with greater adoption, it could be the case within three years. After all, technological advances move very rapidly, especially when it comes to Big Data.
Would I recommend Julia for Big Data? Like Python and R, I think it should be a part of every data scientist’s and data analyst’s tool kit.