Sunday, June 11, 2017

Holoviews Allows for Rapid Data Exploration by Structuring Data

Data analysis and visualization are related. You have to set up a new visualization that makes sense every time you want to explore a set of variables. Furthermore, you have to deal with different data formats and plotting libraries.
Use HoloViews. It forces you to organize your data in such a way that you can automatically visualize it.


Making a choice about which plotting package to use in Python used to be simple. There was just matplotlib pretty much. Nowadays, there are a plethora of visualization packages for Python. Jake VanderPlas recently gave an excellent talk at PyCon 2017 that highlights this which is summarized in the below diagram from the talk.

While choice is good, oftentimes I don't need so much choice when it comes to plotting itself as much as I need to be fancy with the process of setting up the plot in such a way that what I'm looking for is revealed in the plot.

This is where tools like HoloViews and Vega/Altair come in. Vega is a plot+data description language. But HoloViews goes a step further in abstraction: you specify relationships in your data and it does the hard work of presenting it in the best way with great flexibility; It's data-oriented, not plot-oriented.

I find explaining Holoviews difficult because most everyone is used to an imperative style of visualization whereas HoloViews can be considered declarative. Once you get past that, you have to distiguish HoloViews from Grammar-of-Graphics-inspired plotting packages like ggplot2 which is declarative as well. The difference is that ggplot declares plots while HoloViews declares data. In fact, to 'get' how HoloViews addresses the data analysis/visualization problem, I had to read the proceeding for it. One unique aspect of HoloViews that can help understand it is that it enforces a separation of data, plot rendering, plot type (given by an appropriate 'view' of the data), and plot style.

This approach to the problem translates into the following advantages:

  • rapid exploration of data (no, pandas doesn't cut it)
  • export of the HoloViews object as a self-contained html file with interactive plots
  • map to data in its original form like numpy, pandas, lists, as well as Blaze which itself accesses a variety of data sources including databases
  • choose your rendering backend: matplotlib, bokeh, plotly
  • memory-constrained analysis using DynamicMap
  • instantly switch between using interactive controls like sliders to explore variables and more static displays like an array of plots for each variable value

Show me one tool that does all that! Python vs R jab: It's been said for years that Python lags behind R in visualization. I would say parity has been achieved once Python got some ggplot-like tools. But with holoviews, I think the better language for visualization has now tipped in favor of Python!

Monday, September 12, 2016

'Query' meta-data on your data sets


Problem: Suppose you have meta-data on some data sets and you want to select data for certain attributes. That sounds alot like a job for SQL. But the attributes are not strictly in a table format where you have something filled in for every attribute.  You probably have not even decided (beforehand) what attributes you should have for every data set.

Solution: 'Convert' attribute (meta-)data into tables that SQL can query.

Note: The YAML part is just a convenience since it's expected that the meta-data is persistently stored. The meta-data abstraction is just one-level of nested dictionaries. I also hear a YAML reader can read JSON.

Wednesday, September 30, 2015

Recurrent Neural Networks Can Detect Anomalies in Time Series

A recurrent neural network is trained on the blue line (which is some kind of physiologic signal). It has some kind of pattern to it except at t=~300 where it shows 'anomalous' behavior. The green line (not same scale) represents the error between the (original) signal and a reconstructed version of it from the neural network. At ~300, the network could not reconstruct the signal, so the error there becomes significantly higher.

Why is this cool??

  • unsupervised: I did not care about data with anomalies vs data without anomalies
  • trained with anomaly in the data: as long as most of the data is normal, the algorithm seemed robust enough to have learned the pattern of the data with the anomaly in it.
  • no domain knowledge applied: no expert in this kind of time series provided input on how to analyze this data

More details for the more technical people:
- training algo: RMSprop
- input noise added
- the network is an LSTM autoencoder
- it's a fairly small network
- code: theanets 

And that's my master's thesis in one graph!

Monday, August 24, 2015

Run CUDA applications on CoreOS


Use this Dockerfile to install NVIDIA drivers and CUDA on more recent versions of CoreOS. It works by installing the NVIDIA Linux kernel module using plain Linux kernel source (containers see the kernel of the host OS, not the kernel of the container OS).

There are otheDockerfiles that manage this but they ask that you juggle two installations of the driver: one on the host and the other in the container. With the Dockerfile that I've developed, you only have one driver installation to worry about.


I find having to do this a bit hacky and against the containerization philosophy. Having the kernel module loaded from a Dockerfile and then, as a consequence, not being able to have multiple driver versions on the host. But maybe I'm asking too much from Docker's virtualization technique as I don't think it was meant to virtualize such low-level functions of the operating system.

Still, it's not that bad. Being able to use other CUDA-enabled Dockerfiles with only slight modification is great. I can also load and unload the kernel module at will. You just can't have two versions of the module running at the same time which isn't too much of an issue with GPU computing as you're probably going to not leave enough resources for other GPU processes on the (same) host.


Monday, August 10, 2015

"Personal Compute Cloud" Infrastructure Code


Problem: Automate computing infrastructure setup
Solution: Docker hosts on CoreOS machines provisioned with Ansible.

I've recently finished coding up a solution to tackle 'personal' distributed computing. I was bothered by the (apparent) lack of a framework to handle the coordination of setting up multiple machines. And shell scripts are messy. Once I learned Ansible, I was not bothered! (It will be the only systems automation tool I will be using in the foreseeable future! yah..Ansible is AWESOME!)

Catering to the Scientific Computing Workflow: However, mere automation was not my only concern. I wanted a seamless transition from what I'm working on locally to being able to bring more computing power from remote machines. Unlike (pure) software engineering there isn't a 'development' environment and a 'production' environment. Now there are a handful of codes out there that can help you provision CoreOS clusters, but that does not fit well with the scientific computing workflow.

Status: Most of the functionality that I had planned has been implemented. However, like all codes, it's a work-in-progress. I'll be adding functionality as needed by my priorities.

Try it out.

Tuesday, May 26, 2015

Use Vagrant FROM Ansible to Automate Hybrid Cloud Infrastructure


The Intro

This is NOT about having Vagrant provision with Ansible. This is about having Ansible treat Vagrant as a provider of hosts.

Building on my previous experience with the 'cloud', I still felt like I needed another tool to script and glue the process of getting my infrastructure up. I started out with shell scripts but they quickly got messy as the complexity increased. I knew about all the devops tools out there but I avoided them because I thought they would be too complex themselves for what I wanted to do which is relatively simple. But I bit the bullet on went full-on devops with Ansible.

Ansible is GREAT! I found it suitable for (technically-minded) beginners. However, it still took me a few days to get the hang of it. I had to get a little bit under the hood since it did not do what I wanted it to do out of the box.

I want to setup something like a hybrid cloud where I run some services locally and just bring up high-performance compute nodes on demand and have them talk with my local services. I use Vagrant to setup local virtual machines. Vagrant is great for development environments but when I want to manage and orchestrate several VMs locally (let alone on the cloud), things can get messy.

So, I (further) developed ansible-vagrant to interface with Vagrant from Ansible (solving cygwin problems along the way).

The Cream

You can, from Ansible

  • Set state=(up|halt) for some VM
  • Get a Vagrant host inventory
  • Get a SSH config for a host
  • Destroy VMs