Friday, May 3, 2013

pandas on Python kept me from using R

I wanted to analyze this time-series data from GEFCOM visually. The data have missing values and were presented in a way that I didn't want for my analysis.

This looks like a job for R (not numpy/scipy/matplotlib). I had spent a few days playing with R for its supposed excellent visualization and data manipulation capability. I consider myself a Python advocate but Python's main graphics package, matplotlib, is no where near what comes with R. Still, while R's graphics capabilities are great, it wasn't enough for my needs; I needed to work with several visualization packages each with its own way of data input formatting. The all-too-common task of data munging is what I had to deal with. Still, was the data manipulation capability of R enough to keep me on R? After reviewing my analysis needs and what Python had to offer, the answer was: no.

The crucial component in this decision is the existence of the pandas package. In one package, I can read a csv with missing values, reformat it as time series, slice it & dice it any way I want, and plot them for a quick exploratory analysis. Then, in Python syntax, I can reformat the data to feed into another program. When I say, 'Python syntax' I don't mean something like textwritingfunc("some text", anopenfile). I mean a (readable) pseudocode-like syntax that makes use of set and list comprehensions like

out=[ atimeseries:[mathematicaltransformation(index,value) \
for index,value in atimeseries] for atimeseries in alotoftimeseries]
//this could be a python generator to delay processing of the list until file writing to minimize memory usage.


//program 1 wants data in a "stacked" format
write header meta data
for atimeseries in mytimeseries: somefile1.writeline(transformedtimeseries in out)

//program 2 wants data in a "wide" format
maybe prog2 doesn't need a header
somefile2.writeline([transformedline for atimeseries in out for alotoftimeseries]) 

This kind of simple yet powerful syntax takes care of the idiosyncratic data input needs of different visualization software. And, if you're a smart coder, you can generalize your code with object-oriented programming for output to multiple programs with minimal fuss. I'm not sure R can do that elegantly.

  • R is a comprehensive statistical tool for analysis and visualization. If all you need to do is in R, stay in R.
  • Python is a viable alternative to R for basic to intermediate statistical analysis and visualization. However, expect Python to acquire more statistical tools with time. Even if that's not the case, Python is already well-known as a "glue" language that allows you to pass around data from/to different software thereby scripting an analysis process.
  • Python is a better general programming language (it's really hard to say that this is an opinion).