In my recent article posted on May 16, I compared functionalities for R, Octave and Python at a very high level. The article received many insightful comments. I wanted to share what the commenters had to say—this follow-up is to clarify or expand upon some of the points raised.
I will focus on two hot discussion points here: Whether Python should be listed as a powerful analytical tool alongside with R, and whether R functions well with big data.
Is Python a legitimate data analysis tool?
“It is a programming language, … roughly akin to Perl, Java, Ruby, and other scripting and rapid application development languages. …”
I realized that I need to differentiate “Python by itself” and Python with packages, thanks to comments from Dingles, HuguesT and tb()ne:
“You can do your data analysis using scipy, visualization with matplotlib and wrap a gui around it using pyqt or pygtk. … also using numpy vectors in scipy package increases performance by an order of magnitude (my qualitative opinion). “ – Dingles
“…the packages numpy, scipy and scikit as well as matplotlib and MayaViz turn python into something much more than a scripting language. … Many of the recent scientific libraries come with python binding (check out ITK, VTK, etc), not perl or lua or even Java. “ – HuguesT
My passion with Python started with its natural language processing capability when paired with the Natural Language Toolkit (NLTK). Considering the growing need for text mining to extract content themes and reader sentiments (just to name a few functions), I believe Python+packages will serve as more mainstream analytical tools beyond academic arena. There is an insightful blog on Natural Language Processing with Hadoop and Python:
“… NLTK is a great tool in that it tries to espouse the same “batteries included” philosophy that makes Python a useful programming language. NLTK ships with real-world data in the form of more than 50 raw as well as annotated corpora. In addition, it also includes useful language processing tools like tokenizers, part-of-speech taggers, parsers as well as interfaces to machine learning libraries. …”
My own experience working with NLTK has shown me just how powerful and flexible it can be for analytics professionals. Here is an example:
Suppose we need to extract key themes from a document. First, we will import the NLTK and Regular Expression toolkits:
import nltk # imports the NLTK to Python
import re # imports Regular Experssions
Then import the document we’d like to analyze and list of stopwords:
from yourcorpus import * # imports the document corpus for analysis
from nltk.corpus import stopwords # imports NLTK stopword vocabulary of common words like ‘the’ ‘and’ , etc.
stopwords=nltk.corpus.stopwords.words(‘english’) # loads the english version of the stopword vocabulary
Start the analysis:
track = [word for word in text if word not in stopwords] # finds non-stopwords and puts them in a tracking bucket
remove_punctuation = re.compile(‘.*[A-Za-z0-9].*’) # a regular experssion to keep only alpha-numeric values, and exclude some annoying punctuations
filtered = [word for word in track if remove_punctuation.match(word)] # filters out most punctuations
freq_distribution = FreqDist(filtered) # loads the frequency distribution of the filtered words into freq_distribution
freq_count = freq_distribution.items() # loads the frequency count from freq_distribution
print freq_count[:50] # prints the top 50 most frequently occurring words with counts
That way, we could easily extract the most frequently used keywords from the document and identify the theme.
Is R good with big data?
When I marked R as “not good with Big Data,” I was thinking about terabytes of data. But I was apparently behind the curve now in this regard. A company called Revolution Analytics, which specializes in parallel implementation of R, has come up the built-in RevoScaleR package for big data import, manipulation and statistical algorithms. It claims its XDF file format makes big data processing much faster.
The catch is, it’s not free, and I haven’t used it personally. Its actual capability of handling things like large-matrix decomposition needs to be validated (if you have experience with this package, please offer your point of view). Here are some other options:
- A general statistical approach is to randomly sample X% of the data. That will reduce data volume to something R can handle in-memory (multi-GB).
- Combine R with other packages. ceoyoyo, kludge and mpetch recommend a combination of R and Python via Rpy2.
- Design your analysis and optimize the data structure first. For example, we can cut a file with 12 months’ worth of data into 4 pieces of quarterly data file, if warranted by the objective of the analysis. This approach was well illustrated by an anonymous Slashdot reader:
“… ways around that through smart planning, variable use, and multiple data files for different variables so not all are in memory at once (of course databases implements all three at once internally).”
Lastly, there are many other packages out there, as dondelelcaro pointed out:
“There are also packages like ff and others which handle absolutely gigantic files by offloading parts of them to storage and only allocating memory for them (and storage) when required.”
“… one of my students wanted to do spectral analysis on large data sets of power collected from wind rotors. He tried Matlab and the processing lasted for tens of minutes; he switched to Python+Numpy+Scipy (AFAIR) and the thing ran hundreds of times faster.”
In the rapidly evolving era of “big data,” there is no monolithic one-size-fits-all solution. However, there is an emerging selection of packages that can potentially offer you significant advantages in developing and deploying “big data” solutions, depending on your specific needs. If you have something that works well for you, please share your experience below.
Image: Antonov Roman/Shutterstock.com