Tuesday, April 9, 2019

I'm blogging on my personal site


I'm proud that I've kept this blog going since January 2012. That's 7 years!

Throughout this time I've made a transition from 'mechanical engineering' (in quotes because I was formally trained in mechanical engineering but didn't practice it) to data science.

Also throughout this time, I resisted making my own site. I saw that it was 'cool' to use a static site generator from ~2015 onward. Jekyll was looking pretty cool!

I resisted because I wanted to just be able to write my content and publish it and I was okay with forfeiting customization. But you can only resist for so long!

In the end I settled on making my own site using Lektor. Hopefully I can figure out how to include mathematical symbols and Jupyter notebooks...later. :D

Sunday, December 2, 2018

Plot entropy can be used to automatically select scatter plot transparency

Problem: As a data scientist, you make scatter plots to assess visually a distribution of points. However, many times the points are too dense which can give you a wrong impression about the point distribution. So, you might adjust the transparency setting (usually called alpha) a few times. Now, let's say you change the dataset. Again, repeat adjusting the transparency.
Solution: Optimize the plot image entropy because it quantifies the 'variety' of color in it.


When you adjust the transparency, you are eyeballing a measure of image color 'variety'. I went for an information-theortic measure of 'variety'/'dispersion' as opposed to a statistical one (like standard deviation). What I like about information theory is that I can be less mindful of specific statistical distributions and models.

The magic line to calculate color image entropy with scipy/numpy is:
entropy(histogramdd(img, bins=[arange(256)]*3)[0].flatten(), base=2)
where img is shape (width*height, 3) with 8-bit resolution for each color.

The below figures show a monochromatic and color version of a data set scatter plot with alpha (a) and entropy (s) annotated at the top right. The alpha was varied which results in different entropy values. The red annotation signifies the highest entropy.

I think the utility and quantification is easier to see in the monochromatic version perhaps due to it being less 'busy'. But the numbers don't lie!

I'd like to package this somehow if there is interest.

Tuesday, October 30, 2018

RAPIDS for data science signals potential maturation of (big) data science computing

The recent RAPIDS announcement by Nvidia was portrayed as 'data science on GPUs'. In my opinion, it's about the convergence of several trends in data science tools and computing that have been developing over at least 5 years. This convergence naturally materializes as 'data science on GPUs'. Nvidia pounced on the opportunity!

These trends address a dream that I have as a data scientist: I'd like to use pandas and sklearn without having to think about whether the data fits on one machine or not and to use GPUs if available. I also would like to use SQL without having to think about whether the system that I'm executing on is a database. In other words, I'd like to use my preferred programming language and associated libraries without regard to the system that executes the program.

What does this have to do with GPUs? You don't need GPUs to have such a system (*cough* Spark). But it seems like Nvidia, after it made deep learning practically possible, realized there was much more that it can accelerate upstream the data science process pipeline and in the process of doing so, helps achieve this ideal user scenario.

So to try to get an up close look at RAPIDS, I recently went to GTC DC. Pondering over what I learned at the conference, I've realized that RAPIDS has a place in the following trends that all lead to 'rapid' data science iteration.

> The distinctions between analytical databases and data processing systems are blurring.

Spark can do SQL. Spark can do dataframe operations. Some pandas operations resemble SQL operations and vice versa. Functionally, the only thing that a data processing framework needs to have in order for it to be considered an analytical database is data management. The best (only?) effort I've seen so far that tries to bridge data processing and databases from a programmer's aspect is Ibis. For the execution aspect, executing SQL on GPU dataframes (a component of RAPIDS) is here.

> High-performance computing meets data science.

HPC software and hardware addresses distributed computing and accelerated computing on GPUs as well as on CPUs. So what's the problem? Well, traditionally, HPC was less concerned with performing operations over massive amounts of data that would have to flow through some I/O bus bottleneck from a disk. There was a mismatch between HPC hardware & software and data science workloads.

Nonetheless, today, 'deep learning' is a data science application that makes use of HPC capabilities. On the hardware side, the HPC community is more committed to building facilities that can support a variety of workloads including machine learning and 'big data analytics'.

> Data scientist tools are being pushed upstream the data science pipeline.

'Big data analytics' in the context of data science are the steps that happen before model training that often involve relatively simple operations on large amounts of data (that don't fit on one machine) such as joining, selection, filtering, cleaning, and perhaps feature generation. These tasks are usually the responsibility of databases and 'big data' processing systems like Spark.

Unfortunately for data scientists, databases have rigid interfaces and are not easily programmable (SQL is not a programming language, ok?!). Spark offers decent programmability but runs on the Java VM. This is quite foreign to numerical programmers used to R, Python, C, Fortran, and now, Julia.

But that didn't stop data science engineers from pushing their favorite wares into the 'big data' realm.

Currently, the impressive Dask library is pretty much the goto tool for easy Python-based distributed computing (which readily integrates with compiled code for execution on GPUs or CPUs). More recently however, Ray has emerged as another library that can be used for distributed computation which, in my opinion, offers potentially better integrations with RAPIDS than Dask (but that's another subject).

Another interesting work, if all you care about is Tensorflow, is Tensorflow Transform which is a framework to fully integrate the data science pipeline covering both training and serving situations in one swoop.

> Data science libraries are decoupling their interface from their execution.

It should be easy to argue that numpy, pandas, and sklearn have been successful. Unfortunately, the use of these tools is generally tied to a single CPU on a machine. Nonetheless, due to their success, their interfaces have become models to emulate for distributed data science (ok, pandas not so much :/ ). For example, Ray and Dask have 'distributed pandas'. Nvidia's ML algorithms are copying the sklearn interface. As another example, frameworks used for deep learning like keras and Tensorflow are just thin interfaces that talk to an execution engine.

As a side benefit, this decoupling should allow one to use their favorite programming language to interact with these compute systems.

> Machine learning is becoming more automated.

You can imagine that, if you structure and parameterize your machine learning pipeline, it becomes an optimization problem that would benefit greatly from  being able to execute many configurations of the pipeline in order to find the best model. See this excellent survey on automated machine learning.

> Distributed machine learning workflows are being developed.

These days you can easily request a compute cluster from, say, Amazon, and be productive within about 10 minutes. Acquiring and deploying software to distributed systems is comparatively easy these days compared to just five years ago.

Kubernetes has been important in allowing developers to focus more on building applications instead of managing a distributed system. Now, users shouldn't have to deal with Kubernetes directly but they can be given a 'handle' to the cluster that details the available resources provided by Kubernetes.

As a result, Kubeflow was developed as a solution to manage machine learning deployment on Kubernetes. Also, Pachyderm manages data science workflows on Kubernetes.


While exciting, RAPIDS isn't mature yet. At the moment, RAPIDS is a set of libraries that need to be tied together to make a user-friendly experience. This is more challenging than typical software as developers have to consider, 1., distributed computing and, 2., GPU computing in addition to, 3., machine learning algorithms. I hope this added complexity doesn't reduce the number of possible contributors.

Personally, I'd like to see some more top-down efforts to achieving the data scientist's dream of converged, distributed, and accelerated data science by defining some interface(s) to perform data science (sklearn has emerged as an excellent model!)*. What I'm seeing so far is more bottom-up: Nvidia implementing some algorithms, Anaconda providing Dask for distributed computations, and Arrow providing the data structures. As a user, I shouldn't have to make software choices corresponding to different hardware scenarios: single CPU, multiple CPU single machine, CPUs on multiple machines, single GPU, multiple GPUs single machine, GPUs on multiple machines, and even heterogeneous hardware situations (Is this asking for too much?!).

The usability of RAPIDS is critical to its success. RAPIDS is supposed to enable 'rapid' iteration. One can only rapidly iterate if the RAPIDS workflow is as easy to use as what has been developed for 'traditional' single CPU workflows minding that CPUs might be better for some cases.

But even with an ideal data scientist user experience, GPU databases will still have their place as databases. However, I expect that the GPU databases will accommodate RAPIDS workflows by at least providing low-resistance data interchange with RAPIDS components via GPU dataframes.

* Maybe a shim can be made between a user and sklearn that intercepts calls and dispatches to distributed, possibly GPU-equipped,  systems if such a system is available and implements the requested call.

Sunday, June 11, 2017

Holoviews Allows for Rapid Data Exploration by Structuring Data

Data analysis and visualization are related. You have to set up a new visualization that makes sense every time you want to explore a set of variables. Furthermore, you have to deal with different data formats and plotting libraries.
Use HoloViews. It forces you to organize your data in such a way that you can automatically visualize it.


Making a choice about which plotting package to use in Python used to be simple. There was just matplotlib pretty much. Nowadays, there are a plethora of visualization packages for Python. Jake VanderPlas recently gave an excellent talk at PyCon 2017 that highlights this which is summarized in the below diagram from the talk.

While choice is good, oftentimes I don't need so much choice when it comes to plotting itself as much as I need to be fancy with the process of setting up the plot in such a way that what I'm looking for is revealed in the plot.

This is where tools like HoloViews and Vega/Altair come in. Vega is a plot+data description language. But HoloViews goes a step further in abstraction: you specify relationships in your data and it does the hard work of presenting it in the best way with great flexibility; It's data-oriented, not plot-oriented.

I find explaining Holoviews difficult because most everyone is used to an imperative style of visualization whereas HoloViews can be considered declarative. Once you get past that, you have to distiguish HoloViews from Grammar-of-Graphics-inspired plotting packages like ggplot2 which is declarative as well. The difference is that ggplot declares plots while HoloViews declares data. In fact, to 'get' how HoloViews addresses the data analysis/visualization problem, I had to read the proceeding for it. One unique aspect of HoloViews that can help understand it is that it enforces a separation of data, plot rendering, plot type (given by an appropriate 'view' of the data), and plot style.

This approach to the problem translates into the following advantages:

  • rapid exploration of data (no, pandas doesn't cut it)
  • export of the HoloViews object as a self-contained html file with interactive plots
  • map to data in its original form like numpy, pandas, lists, as well as Blaze which itself accesses a variety of data sources including databases
  • choose your rendering backend: matplotlib, bokeh, plotly
  • memory-constrained analysis using DynamicMap
  • instantly switch between using interactive controls like sliders to explore variables and more static displays like an array of plots for each variable value

Show me one tool that does all that! Python vs R jab: It's been said for years that Python lags behind R in visualization. I would say parity has been achieved once Python got some ggplot-like tools. But with holoviews, I think the better language for visualization has now tipped in favor of Python!

Monday, September 12, 2016

'Query' meta-data on your data sets


Problem: Suppose you have meta-data on some data sets and you want to select data for certain attributes. That sounds alot like a job for SQL. But the attributes are not strictly in a table format where you have something filled in for every attribute.  You probably have not even decided (beforehand) what attributes you should have for every data set.

Solution: 'Convert' attribute (meta-)data into tables that SQL can query.

Note: The YAML part is just a convenience since it's expected that the meta-data is persistently stored. The meta-data abstraction is just one-level of nested dictionaries. I also hear a YAML reader can read JSON.

Wednesday, September 30, 2015

Recurrent Neural Networks Can Detect Anomalies in Time Series

A recurrent neural network is trained on the blue line (which is some kind of physiologic signal). It has some kind of pattern to it except at t=~300 where it shows 'anomalous' behavior. The green line (not same scale) represents the error between the (original) signal and a reconstructed version of it from the neural network. At ~300, the network could not reconstruct the signal, so the error there becomes significantly higher.

Why is this cool??

  • unsupervised: I did not care about data with anomalies vs data without anomalies
  • trained with anomaly in the data: as long as most of the data is normal, the algorithm seemed robust enough to have learned the pattern of the data with the anomaly in it.
  • no domain knowledge applied: no expert in this kind of time series provided input on how to analyze this data

More details for the more technical people:
- training algo: RMSprop
- input noise added
- the network is an LSTM autoencoder
- it's a fairly small network
- code: theanets 

And that's my master's thesis in one graph!

update 12/2018:
This post has been getting much attention which I appreciate. However, I find myself obligated to point readers to the latest research which obviates RNNs. Here is a great introduction towards the latest research The Fall of RNN LSTM.