Tuesday, December 24, 2013

PyData NYC 2013.. Breeding Ground for Highly-Innovative Data Science Tools

I recently attended the PyData conference in New York City. I'd like to give my general impression (as opposed to showing you details you can get by visiting the links on the website). The impression I want to express is that of the culture at the conference: hacky..in a good way. That is, tools written by non-specialists to tackle problems facing their application domain where they are specialists in. This is partly due to the flexibility of Python and partly due to the highly-technical profile of the attendees of the conference.

This is just an impression so it's very subjective and I'm biased due to my enthusiasm for Python and my scientific computing background which Python has a firm grip on. But try comparing PyData to Strata which is a higher-level view of the data science world with R being the main analysis tool. The broader data science world is a collision of computer science and statistics both in theory and the tools used. While the narrower PyData world has its roots in the more harmonious scientific computing world where math meets computing albeit the math is generally less statistical.

Until data science curricula are developed, I believe the scientific computing background is a better foundation for data science than statistics alone or computer science on its own. Computational, data, and domain expertise are present in skilled scientific programmers, some of whom attended the conference. The caliber of talent at the conference was astounding. Attendees could, for example, talk about specific CPU and GPU architectures, database systems, compiled languages, distributed systems, GUIs; as well as talk about monte carlo techniques, machine learning, and optimization. Such broad knowledge of all of these areas is important for the implementation of a speedy scientific workflow which happens to be necessary for data science as well.

I'm also going to claim there are more tech-savvy scientists than there are tech-savvy statisticians. This isn't to diminish the importance of statisticians in data science but the computational pride of statisticians is in the comprehensive but slow and inelegant R language. Meanwhile scientific programmers know about, and have built eco-systems around, C/C++, Fortran, and Python and all the CS-subjects associated with these languages including parallel programming, compilation, and data structures. This is just the natural result of statisticians traditionally working with "small data" while scientific programmers often work with "big computation".

The mastery of these issues within the same community is what allows for innovative products such as scidb, bokeh, numba, and ipython and all the various small-scale hackery presented at the conference.

Perhaps I should hold off on making such claims until I go to the next Strata conference but this is a blog journal and I'm very late in writing about PyData!

Thursday, September 12, 2013

Simple plots can reveal complex patterns

Visualization is a big topic on its own, which implies that you can get quite sophisticated in making plots. However, you can reveal complex information from simple plots.

I took a shot at visualizing power generation data from the Kaggle competition. My goal was just to make a "heat map" of the power generation data: for every <week, hour of week>,  plot the power generated. Now, I had to rearrange the data a bit but the result was not only pretty, but more importantly, very revealing and efficient. The plot summarizes ~25k data points by revealing cycles over days and months over several years.

Enjoy. Courtesy of pandas for data munging and matplotlib for the plot.

Friday, May 3, 2013

pandas on Python kept me from using R

I wanted to analyze this time-series data from GEFCOM visually. The data have missing values and were presented in a way that I didn't want for my analysis.

This looks like a job for R (not numpy/scipy/matplotlib). I had spent a few days playing with R for its supposed excellent visualization and data manipulation capability. I consider myself a Python advocate but Python's main graphics package, matplotlib, is no where near what comes with R. Still, while R's graphics capabilities are great, it wasn't enough for my needs; I needed to work with several visualization packages each with its own way of data input formatting. The all-too-common task of data munging is what I had to deal with. Still, was the data manipulation capability of R enough to keep me on R? After reviewing my analysis needs and what Python had to offer, the answer was: no.

The crucial component in this decision is the existence of the pandas package. In one package, I can read a csv with missing values, reformat it as time series, slice it & dice it any way I want, and plot them for a quick exploratory analysis. Then, in Python syntax, I can reformat the data to feed into another program. When I say, 'Python syntax' I don't mean something like textwritingfunc("some text", anopenfile). I mean a (readable) pseudocode-like syntax that makes use of set and list comprehensions like

out=[ atimeseries:[mathematicaltransformation(index,value) \
for index,value in atimeseries] for atimeseries in alotoftimeseries]
//this could be a python generator to delay processing of the list until file writing to minimize memory usage.


//program 1 wants data in a "stacked" format
write header meta data
for atimeseries in mytimeseries: somefile1.writeline(transformedtimeseries in out)

//program 2 wants data in a "wide" format
maybe prog2 doesn't need a header
somefile2.writeline([transformedline for atimeseries in out for alotoftimeseries]) 

This kind of simple yet powerful syntax takes care of the idiosyncratic data input needs of different visualization software. And, if you're a smart coder, you can generalize your code with object-oriented programming for output to multiple programs with minimal fuss. I'm not sure R can do that elegantly.

  • R is a comprehensive statistical tool for analysis and visualization. If all you need to do is in R, stay in R.
  • Python is a viable alternative to R for basic to intermediate statistical analysis and visualization. However, expect Python to acquire more statistical tools with time. Even if that's not the case, Python is already well-known as a "glue" language that allows you to pass around data from/to different software thereby scripting an analysis process.
  • Python is a better general programming language (it's really hard to say that this is an opinion).

Wednesday, March 20, 2013

C vs. Python in the Context of Reading a Simple Text Data File

Problem: Using C, read a Matrix Market file generically and versatilely.
Generically: using the same code (at run-time), interpret a line of data in various ways

I'm trying to do this in C. Coming from Python, C makes me feel CRIPPLED. Laundry list of offenses:

  • Primitive string operations...oh actually they are just character arrays.
    example: char teststring[10]="asdf"; (OK initialization) teststring="qwer"; (FAIL what I expect is a reassignment fails..turns out you must call string copy and make sure the destination has enough space.).
  • No negative indexing
  • Can't get length of an array
  • Can't handle types as objects. You can't say: if this object is of this type, then do this. Types are 'hardcoded'. You can't return a type.
  • Case labels must be known at compile-time. The C compiler can't understand that if I use a function output with a known input at compile time to make a label that it could be known at compile-time. As a mathematically-inclined programmer, all I care about is functionality. I don't like to think about the preprocessor as a separate process.
  • Your program can compile and work but in the realm of "undefined behavior". Good luck finding a bug caused by "undefined behavior".
Which brings me to some C items that have overlapping functionality: enum, arrays, and X-Macros. They can all be viewed as lists, but macros live in the preprocessor; you cannot iterate over enums and they are essentially only a list of integers; and if you want to know the size of an array, you can only do it at compile-time with a macro hack.  So if you need functionality on the same list that had to be implemented in the preprocessor and at run-time, you have to violate DRY.

While I do think lower-level functions are essential for some purposes, this exercise makes me appreciate Python more (yes particularly Python and not other interpreted languages). In Python, I can read the lines in as a matrix of strings, then I could vectorize type conversion operations in a few lines of code.

So, I've resorted to code generation to create functions for all cases which solves half the problem. I won't bother trying to make things more generic. If it could be done, I expect that it's going to look messy.

4/17 Update: I've finished my reader. The "main()" file is here. I think I organized my procedure well but it took alot of code. It's ridiculous. I could accomplish the same with acceptable speed in about two dozen lines of Python code. I know this because I've written code to read data generically in Python several times.

Thursday, February 14, 2013

Mathematica vs. Maple vs. SAGE/scipy UX

This post comes from significant experience with Maple and numpy/scipy/python.

I've been using Maple and numpy for a few years. I stopped upgrading my Maple at version 13 because improvements didn't compel me to pay for a newer version for my use. I chose Maple as a happy medium between the numerical and matrix-focused MATLAB and the symbolically-focused Mathematica. They can all do symbolics and matrix operations but they differ in how elegantly they're performed. But, as I gave Maple more complex mathematical tasks such as the calculus involved with Fourier transforms, and symbolic tensor & matrix operations, it became clear that it wasn't up to the task. I knew Mathematica was superior for those tasks but I chugged along with Maple until I got to George Mason University where they had a site license for Mathematica.

Right off the bat I could see a clear superiority in the user experience (mind that I didn't use Maple beyond ver 13). Vectors and matrices are just lists. In Maple there are objects called vectors, lists, tables, and matrices and you sometimes had to import them from a module. Clearly, in just this area, Mathematica is superior. Mathematica includes alot of functionality out of the box. In contrast, in Maple, I had to import modules for what I considered basic functionality which makes the experience not as consistent.

In comparison to other math environments:

  • SAGE: SAGE uses a web interface which can't come close to a desktop application's functionality. The UI does not come close to Mathematica. Also, the open-source symbolic packages it uses are primitive compared to Maple let alone Mathematica. Oh and good luck installing it on Windows.
  • numpy/scipy: Numpy and scipy give mathematical functionality as opposed to being a math environment; basic ones at that. But, its indexing coolness does exist in Mathematica.
  • Mathematica has pattern-matching..others don't enough said.

For math-centric programming, Mathematica is a high bar to get to in terms of consistency, documentation, functionality, and general experience. My general gripe about open-source is the lack of an umbrella vision that moves a project in a certain direction. SAGE is the best (the only) hope but it has a long way to go. Even in the best scenario, getting different python packages seamlessly working together is doubtful (eg. look at my sym2num function).

Conclusion: If all you need is basic functionality, Octave, SAGE, scipy, sympy, and maxima could suit you. For more advanced tasks, be prepared to pay up.

PS: This post will change as I learn more.
Support for my opinion.
Support for my opinion.

Tuesday, January 29, 2013

Personal Note: Adjustment in Career Path Towards Data Science

I want this blog to be about the work that I do and not about my person. But there will be a change in the the content of the blog which can only be explained by the change in my professional situation. So far, computational topics have been about FDTD simulation, molecular dynamics simulation, python, and scientific computing in general. I've been a graduate student at Vanderbilt University for four and a half years where I've learned an incredible amount in the areas of nanoscale science & engineering and computing, both of which are areas I did not have the background for. This was all under a mechanical engineering degree but I'm leaving with 'just' a master's degree in "mechanical engineering" as I did not pass my PhD qualifying exam in...(classical) mechanical engineering. Now, I was quite disappointed for a while since I enjoyed my research very much and was looking forward to making contributions to energy conversion device research using computation.

But, looking back only a few months later, I'm glad it happened. I felt that I had a programming talent lurking inside me as I was conducting my computational research and it would be wise to enhance my computational skills which would qualify me very well for certain job types. So, I've started a (second) master's in computational science at George Mason University customizing my curriculum for 'big data'. I've identified 'big data' as a field that I want to get into because I feel that I have the right combination of communication, business, and technical skills to be successful in this field. This potentially means that I might turn my back to science and engineering as an application area of my skills which is something a bit discomforting to me as I've always seen myself as a scientist and engineer. At the same time, advanced scientists and engineers have the potential to be great data scientists.

What does this mean for the blog? I still expect to post about (pure) scientific computing but not about applications like finite-difference time-domain simulations and molecular dynamics simulations. Python will stay with me as a base of my programming skill but it will be complemented by other computing languages that I will learn like SAS, R, C, and Fortran (working on working in HADOOP!). This is what I'm able to predict for now.

Ending with a positive personal note, for most of the past decade I've been striving to get into a job type that I want. I'm very optimistic that I'm almost there.