Friday, April 11, 2014

PCA and ICA 'Learn' Representations of Time-Series

PCA is usually demonstrated in a low-dimensional context. Time-series, however, are high dimensional and ICA might be a more natural technique for reducing dimensionality.

For my current project, I'm not interested in dimensionality reduction per-se; rather I'm interested in how well, given a reduced representation of some base input time-series, how well the algorithm can reproduce a new input. If the new input cannot be recreated well, then it is a candidate for being considered an anomaly.

I've setup an experiment where I generated a bunch of even number sine waves in the domain as (base/training) input to the algorithms plus a constant function. Then I try to reconstruct a slightly different even sine wave, an odd sine wave, and a constant.

The result is that the even sine wave and constant are somewhat reconstructed while the odd sine wave is not. You can see this in the following graphs where the blue line is a 'target' signal and the green line is the reconstructed signal. I get similar results using PCA.
4 waves
5 waves FAIL
constant = 3

There are plenty of mathematical rigor and algorithmic parameters that I haven't talked about but this is a post that requires minimal time and technical knowledge to go through. However, you can figure out details if you examine the ipython notebook.

Monday, February 24, 2014

Strata '14 Santa Clara: Lots of data, not so much analytics

While the quality of the conference was excellent overall, there were too many data pipeline platforms and database types showcased. As a quantitatively-oriented person, I really don't care that much for the latest in-memory database or being able to pipeline my data graphically. In my work, I do everything I can to abstract out data handling particulars. I just care that my algorithms work. I realize I do need to know something about the underlying infrastructure if I want to be able to maximize performance...but why should I??

Now some vendors do have some analytic capability in their platform, but why should I rely on their implementation? I should be able to easily apply an algorithm of my choosing which is built on some kind of abstraction and the vendor should support this abstraction. Furthermore, I should be able to examine analytic code (It's great that 0xdata recognizes that as it is open-source).

This is the 'big data' picture I'm seeing; and I'm not liking the silos each vendor tries to make and the (current?) focus on data platforms. The VC panel emphasized that the value from 'big data' is going to come from applications (which of course relies on analytics). Maybe the reported data scientist shortage has something to do with this?

Please inform me if I'm missing something.

This is in contrast to what I've seen at PyData where perhaps the culture of attendees is more quantitative and technical with a firm grasp of mathematics as well as computing issues. In that conference infrastructure use was dictated by analytics in a top-down fashion.

Saturday, January 4, 2014

Extremely Efficient Visualization Makes Use of Color and Symbols

Not too long ago, I posted a 'heat map' visualization and I said it was an efficient visualization because the plot space was filled with color. But it represented only one quantity.

Well now I took that idea further and made a visualization that represents three (or four depending on how you count) quantities. The following picture represents flow around a cylinder with the following quantities:
- flow potential     background color
- flow direction     direction of arrows
- flow magnitude   length of arrows
- pressure              color of arrow
I even threw in a max and min pressure point in there too.

No need for multiple plots or a third dimension!

Tuesday, December 24, 2013

PyData NYC 2013.. Breeding Ground for Highly-Innovative Data Science Tools

I recently attended the PyData conference in New York City. I'd like to give my general impression (as opposed to showing you details you can get by visiting the links on the website). The impression I want to express is that of the culture at the conference: hacky..in a good way. That is, tools written by non-specialists to tackle problems facing their application domain where they are specialists in. This is partly due to the flexibility of Python and partly due to the highly-technical profile of the attendees of the conference.

This is just an impression so it's very subjective and I'm biased due to my enthusiasm for Python and my scientific computing background which Python has a firm grip on. But try comparing PyData to Strata which is a higher-level view of the data science world with R being the main analysis tool. The broader data science world is a collision of computer science and statistics both in theory and the tools used. While the narrower PyData world has its roots in the more harmonious scientific computing world where math meets computing albeit the math is generally less statistical.

Until data science curricula are developed, I believe the scientific computing background is a better foundation for data science than statistics alone or computer science on its own. Computational, data, and domain expertise are present in skilled scientific programmers, some of whom attended the conference. The caliber of talent at the conference was astounding. Attendees could, for example, talk about specific CPU and GPU architectures, database systems, compiled languages, distributed systems, GUIs; as well as talk about monte carlo techniques, machine learning, and optimization. Such broad knowledge of all of these areas is important for the implementation of a speedy scientific workflow which happens to be necessary for data science as well.

I'm also going to claim there are more tech-savvy scientists than there are tech-savvy statisticians. This isn't to diminish the importance of statisticians in data science but the computational pride of statisticians is in the comprehensive but slow and inelegant R language. Meanwhile scientific programmers know about, and have built eco-systems around, C/C++, Fortran, and Python and all the CS-subjects associated with these languages including parallel programming, compilation, and data structures. This is just the natural result of statisticians traditionally working with "small data" while scientific programmers often work with "big computation".

The mastery of these issues within the same community is what allows for innovative products such as scidb, bokeh, numba, and ipython and all the various small-scale hackery presented at the conference.

Perhaps I should hold off on making such claims until I go to the next Strata conference but this is a blog journal and I'm very late in writing about PyData!

Thursday, September 12, 2013

Simple plots can reveal complex patterns


Visualization is a big topic on its own, which implies that you can get quite sophisticated in making plots. However, you can reveal complex information from simple plots.

I took a shot at visualizing power generation data from the Kaggle competition. My goal was just to make a "heat map" of the power generation data: for every <week, hour of week>,  plot the power generated. Now, I had to rearrange the data a bit but the result was not only pretty, but more importantly, very revealing and efficient. The plot summarizes ~25k data points by revealing cycles over days and months over several years.

Enjoy. Courtesy of pandas for data munging and matplotlib for the plot.

Friday, May 3, 2013

pandas on Python kept me from using R

I wanted to analyze this time-series data from GEFCOM visually. The data have missing values and were presented in a way that I didn't want for my analysis.

This looks like a job for R (not numpy/scipy/matplotlib). I had spent a few days playing with R for its supposed excellent visualization and data manipulation capability. I consider myself a Python advocate but Python's main graphics package, matplotlib, is no where near what comes with R. Still, while R's graphics capabilities are great, it wasn't enough for my needs; I needed to work with several visualization packages each with its own way of data input formatting. The all-too-common task of data munging is what I had to deal with. Still, was the data manipulation capability of R enough to keep me on R? After reviewing my analysis needs and what Python had to offer, the answer was: no.

The crucial component in this decision is the existence of the pandas package. In one package, I can read a csv with missing values, reformat it as time series, slice it & dice it any way I want, and plot them for a quick exploratory analysis. Then, in Python syntax, I can reformat the data to feed into another program. When I say, 'Python syntax' I don't mean something like textwritingfunc("some text", anopenfile). I mean a (readable) pseudocode-like syntax that makes use of set and list comprehensions like


out=[ atimeseries:[mathematicaltransformation(index,value) \
for index,value in atimeseries] for atimeseries in alotoftimeseries]
//this could be a python generator to delay processing of the list until file writing to minimize memory usage.

somefile1=openfile("out1.txt")
somefile2=openfile("out2.txt")

//program 1 wants data in a "stacked" format
write header meta data
for atimeseries in mytimeseries: somefile1.writeline(transformedtimeseries in out)

//program 2 wants data in a "wide" format
maybe prog2 doesn't need a header
somefile2.writeline([transformedline for atimeseries in out for alotoftimeseries]) 


This kind of simple yet powerful syntax takes care of the idiosyncratic data input needs of different visualization software. And, if you're a smart coder, you can generalize your code with object-oriented programming for output to multiple programs with minimal fuss. I'm not sure R can do that elegantly.

Conclusions:
  • R is a comprehensive statistical tool for analysis and visualization. If all you need to do is in R, stay in R.
  • Python is a viable alternative to R for basic to intermediate statistical analysis and visualization. However, expect Python to acquire more statistical tools with time. Even if that's not the case, Python is already well-known as a "glue" language that allows you to pass around data from/to different software thereby scripting an analysis process.
  • Python is a better general programming language (it's really hard to say that this is an opinion).



Wednesday, March 20, 2013

C vs. Python in the Context of Reading a Simple Text Data File

Problem: Using C, read a Matrix Market file generically and versatilely.
Generically: using the same code (at run-time), interpret a line of data in various ways

I'm trying to do this in C. Coming from Python, C makes me feel CRIPPLED. Laundry list of offenses:

  • Primitive string operations...oh actually they are just character arrays.
    example: char teststring[10]="asdf"; (OK initialization) teststring="qwer"; (FAIL what I expect is a reassignment fails..turns out you must call string copy and make sure the destination has enough space.).
  • No negative indexing
  • Can't get length of an array
  • Can't handle types as objects. You can't say: if this object is of this type, then do this. Types are 'hardcoded'. You can't return a type.
  • Case labels must be known at compile-time. The C compiler can't understand that if I use a function output with a known input at compile time to make a label that it could be known at compile-time. As a mathematically-inclined programmer, all I care about is functionality. I don't like to think about the preprocessor as a separate process.
  • Your program can compile and work but in the realm of "undefined behavior". Good luck finding a bug caused by "undefined behavior".
Which brings me to some C items that have overlapping functionality: enum, arrays, and X-Macros. They can all be viewed as lists, but macros live in the preprocessor; you cannot iterate over enums and they are essentially only a list of integers; and if you want to know the size of an array, you can only do it at compile-time with a macro hack.  So if you need functionality on the same list that had to be implemented in the preprocessor and at run-time, you have to violate DRY.

While I do think lower-level functions are essential for some purposes, this exercise makes me appreciate Python more (yes particularly Python and not other interpreted languages). In Python, I can read the lines in as a matrix of strings, then I could vectorize type conversion operations in a few lines of code.

So, I've resorted to code generation to create functions for all cases which solves half the problem. I won't bother trying to make things more generic. If it could be done, I expect that it's going to look messy.

4/17 Update: I've finished my reader. The "main()" file is here. I think I organized my procedure well but it took alot of code. It's ridiculous. I could accomplish the same with acceptable speed in about two dozen lines of Python code. I know this because I've written code to read data generically in Python several times.