Wednesday, September 30, 2015

Recurrent Neural Networks Can Detect Anomalies in Time Series

A recurrent neural network is trained on the blue line (which is some kind of physiologic signal). It has some kind of pattern to it except at t=~300 where it shows 'anomalous' behavior. The green line (not same scale) represents the error between the (original) signal and a reconstructed version of it from the neural network. At ~300, the network could not reconstruct the signal, so the error there becomes significantly higher.

Why is this cool??

  • unsupervised: I did not care about data with anomalies vs data without anomalies
  • trained with anomaly in the data: as long as most of the data is normal, the algorithm seemed robust enough to have learned the pattern of the data with the anomaly in it.
  • no domain knowledge applied: no expert in this kind of time series provided input on how to analyze this data

More details for the more technical people:
- training algo: RMSprop
- input noise added
- the network is an LSTM autoencoder
- it's a fairly small network
- code: theanets 

And that's my master's thesis in one graph!

Monday, August 24, 2015

Run CUDA applications on CoreOS


Use this Dockerfile to install NVIDIA drivers and CUDA on more recent versions of CoreOS. It works by installing the NVIDIA Linux kernel module using plain Linux kernel source (containers see the kernel of the host OS, not the kernel of the container OS).

There are otheDockerfiles that manage this but they ask that you juggle two installations of the driver: one on the host and the other in the container. With the Dockerfile that I've developed, you only have one driver installation to worry about.


I find having to do this a bit hacky and against the containerization philosophy. Having the kernel module loaded from a Dockerfile and then, as a consequence, not being able to have multiple driver versions on the host. But maybe I'm asking too much from Docker's virtualization technique as I don't think it was meant to virtualize such low-level functions of the operating system.

Still, it's not that bad. Being able to use other CUDA-enabled Dockerfiles with only slight modification is great. I can also load and unload the kernel module at will. You just can't have two versions of the module running at the same time which isn't too much of an issue with GPU computing as you're probably going to not leave enough resources for other GPU processes on the (same) host.


Monday, August 10, 2015

"Personal Compute Cloud" Infrastructure Code


Problem: Automate computing infrastructure setup
Solution: Docker hosts on CoreOS machines provisioned with Ansible.

I've recently finished coding up a solution to tackle 'personal' distributed computing. I was bothered by the (apparent) lack of a framework to handle the coordination of setting up multiple machines. And shell scripts are messy. Once I learned Ansible, I was not bothered! (It will be the only systems automation tool I will be using in the foreseeable future! yah..Ansible is AWESOME!)

Catering to the Scientific Computing Workflow: However, mere automation was not my only concern. I wanted a seamless transition from what I'm working on locally to being able to bring more computing power from remote machines. Unlike (pure) software engineering there isn't a 'development' environment and a 'production' environment. Now there are a handful of codes out there that can help you provision CoreOS clusters, but that does not fit well with the scientific computing workflow.

Status: Most of the functionality that I had planned has been implemented. However, like all codes, it's a work-in-progress. I'll be adding functionality as needed by my priorities.

Try it out.

Tuesday, May 26, 2015

Use Vagrant FROM Ansible to Automate Hybrid Cloud Infrastructure


The Intro

This is NOT about having Vagrant provision with Ansible. This is about having Ansible treat Vagrant as a provider of hosts.

Building on my previous experience with the 'cloud', I still felt like I needed another tool to script and glue the process of getting my infrastructure up. I started out with shell scripts but they quickly got messy as the complexity increased. I knew about all the devops tools out there but I avoided them because I thought they would be too complex themselves for what I wanted to do which is relatively simple. But I bit the bullet on went full-on devops with Ansible.

Ansible is GREAT! I found it suitable for (technically-minded) beginners. However, it still took me a few days to get the hang of it. I had to get a little bit under the hood since it did not do what I wanted it to do out of the box.

I want to setup something like a hybrid cloud where I run some services locally and just bring up high-performance compute nodes on demand and have them talk with my local services. I use Vagrant to setup local virtual machines. Vagrant is great for development environments but when I want to manage and orchestrate several VMs locally (let alone on the cloud), things can get messy.

So, I (further) developed ansible-vagrant to interface with Vagrant from Ansible (solving cygwin problems along the way).

The Cream

You can, from Ansible

  • Set state=(up|halt) for some VM
  • Get a Vagrant host inventory
  • Get a SSH config for a host
  • Destroy VMs

Friday, May 1, 2015

CoreOS cloud-config Generator


After moving my workflow over to Docker I realized that it's not a complete solution. There is still the process that comes before Docker that must be addressed: namely, provisioning and managing machines on a (ugghh) "cloud" provider such as Amazon EC2, Digital Ocean, and even Vagrant. Also, CoreOS has become an important player in the Docker scene, providing just the minimum operating system needed to run Docker in a cluster environment. CoreOS also provides some commonality in the process of provisioning machines on different providers by making use of a cloud-config file; The same (or almost the same) cloud-config file can be used on different providers.

That's good news but it introduces problems like:
- I want to keep this piece of the cloud-config file out of source control
- I don't want to copy/paste between cloud-config files (DRY issues) :
    - this cloud-config file is just like this one but with a different password/user/hostname/IP address
    - this cloud-config file is just like this one but with an added section

So, I made a program to address these issues! Find out more by reading the README in the repository.

Comments are welcome.

Friday, February 20, 2015

Serving GTK3 Applications on the Web

....Scientific Python on the Web: not there yet!

It started with a simple goal: Put my interactive matplotlib desktop application on the web so that others could interact with it easily. I succeeded in the end but it was alot of work which pushed the boundaries of my technology knowledge and programming skill. Since it was alot of work, I'll organize the story of my journey into steps:

0. Try out different (existing) ways of putting matplotlib on the web.

I tried IPython errr Jupiter with different matplotlib HTML5 backends. While certainly cool, none of them implemented GUI functionality. That is, I would get an error because they did not respond to mouse clicks. I'm not starting with zero to be nerdy: Had I gotten IPython to work with mouse position clicks I would not have bothered with developing my solution explained in the following steps. This is also considering that IPython is a poor fit since it's a document while my program can be considered an 'app'.

1. Use a matplotlib backend that supports a GUI that can be displayed in a browser: GTK3

After some research, I settled on using GTK3 which comes with the broadway display server which can serve (individually) the application in an HTML5 page. After playing with GTK3 on Ubuntu Linux I managed to get my program to display in the browser as well as any other GTK3 program such as gEdit. It required a little change in matplotlib.

2. Serve the same program to any user who requests it

This was the most difficult part. The same application needed to be served on demand multiple times, perhaps simultaneously. Just pointing users to the display server is not sufficient. The (normal) overall process needs to be: 1. user requests application 2. display server with the application is started 3. monitor connection to disconnect and clean up after user exits.

I broke down the problem into 'modules' that represent each of these three activities though I worked on them in the order below:

  • 2) Display manager: Working on the display manger came naturally after being able to just display GTK3 applications (step 1). It manages the starting up and shutting down of the display servers and the application that run on them. This code is pretty independent of the other modules although I put in functionality keeping in mind what the other modules need to do.
  • 3) Connection manager: I used websockets to periodically send out a request for the user to confirm that he is active. This happens by simultaneously executing python on the server and javascript on the client. On disconnecting, the display and its associated application is stopped.
  • 1) Request handler: It starts the process of starting a display when the user requests it.

Of course, that's easier said than done. The challenge in coding the display manger was managing the Linux processes and cleanly killing them. The connection manager had me go into event-driven programming and websockets. The request handler was rather straightforward to implement. But I had to extract the javascript from the broadway server to integrate it with my the process. So the webpage is (actually) served from the request handler and not the broadway server. I used the Tornado web framework to program the request handler and the connection manager. While the display manager is associated with the request handler (in the same program). Making sure events were coordinated was difficult!

3. Deploy

Having heard about about how awesome Docker is, I decided to recreate my development environment in a docker container. It was a bit tricky to get the networking working but I'm impressed with what is possible with Docker. Currently, my container is living in while actually being served from AWS. Please volunteer to serve my application!

Reflections: Scientific Python on the Web

I have to say that I'm pretty satisfied with the result even though it's not ideal. As soon as as I got the task accomplished, I stopped working on it even though I could sink alot more time in it to improve quality, usability, and flexibility. Personally, I learned ALOT working on this project.

But I was disappointed that over in the R world you can create apps with GUIs on the web much more easily with shiny. They recently added mouse position clicks. I say it's disappointing because python is associated with multiple application domains including GUIs. Now in my research for solutions, I found ways to integrate matplotlib into GUIs but that would require some GUI expertise. There is no obvious solution for people used to the scientific python stack as to which GUI framework to use if web publishing is a concern. The people used to the scientific stack are not GUI experts nor are they web developers.

Having said that, what I like about my solution is that there is a path from matplotlib to a full GTK3 GUI application. So you can start with (simple) matplotlib elements and then if you decide you need the functionality of a real GUI you can integrate your work into the GTK3 framework. I've tried it. So you could have an app that runs on the desktop as well as the web. That is superior to shiny.

Some people have commented on the state of scientific python on the web. As part of the solution, I think somehow documents (html, IPython) and applications (GUIs) need to merge. The web has become a medium to deliver experiences.

Unfortunately, for delivering interactive python on the web, there is still alot of work to be done. But just by myself, I was able to deliver a product, albeit hacky, that serves python applications on the web using open source: Docker, Tornado, scipy/numpy, matplotlib, GTK3, Ubuntu, Linux...etc. Imagine what would happen if the open source community came together to work on this problem. Some components exist but now it has to come together. Hopefully, an open-source solution can be superior to a proprietary one.

Introduced at DC Python meetup.

Thursday, February 19, 2015

Bayesian Optimization Demo Game: Can you beat the 'computer'? interactive matplotlib (GTK3) app served on the web.
app (not working properly on Chrome): []

I recently made a notebook about Gaussian processes (GP). Now, Gaussian processes are used by Bayesian optimization. In Bayesian optimization, the goal is to optimize a function with as few (probably expensive) function evaluations as possible. Here's a good tutorial.

To see how it works, I implemented a basic optimizer following the mathematics introduced in the referenced tutorial (which wasn't as difficult as the math behind the GP!). I began from the code for the GP, to see how a Bayesian optimizer (BO) would optimize a toy 1-D problem.

But then, I wondered how the performance of a human would compare to the BO. So I made a game! In this game, the goal is to, of course, find the maximum of the function is as few tries as possible. Both players in each trial will attempt to find the maximum. A running average of the number of tries is kept.

After some experimentation playing the game, it seems there is a lower bound for the average number of tries needed. For the way I have my game set up (50 possible choices), the BO needed 10 point something +/-  point something tries. Furthermore, trying different so-called acquisition functions for the BO did not have much of an effect. This figure is based on thousands of trials (central limit theorem at work). And playing as the creator of my game, it was difficult for me to keep below 10 tries as an average. Of course, I can't play thousands of times.

These results imply that a human would not be able to optimize a function in three dimensions or more in less tries than a BO. There are simply too many variables to keep track of and it is very difficult if not impossible to visualize which is part of the point of using BOs.

Some features of the program:

  • For aesthetic reasons, the y-axis has to be fixed based on the range of values of the function. But then the visual clue of having a point close to the top edge of the plot gives the human player an unfair advantage. The solution is to add some random margin between the maximum and the edge. So, if you happen to get a point that is close to the top edge, choose its neighbors!
  • To generate a 'random' smooth function, I get some normally distributed points and sprinkle them on the domain. Then I use splines to connect them. It works surprisingly well!

Implementation notes:

The programming started in a functional style which is what you'd expect out of mathematical code and it's what I'm used to. However, once I got into matplotlib's GUI stuff, things started to get a little messy as a GUI requires reactive event-driven programming. Both styles of programming exist in my program and they intersect at while loops. You can run the game script with python on your desktop from the command line.

Now, the work involved in putting the game on the web, so that you, the reader, can easily engage in it, deserves its own post or two! Once I had the 'desktop' application running, I swam through oceans of technology until I got it to the form presented here. I should mention here that I came across a program similar to mine on Wolfram|Alpha but I can't find it anymore.

So here is the game served on the web. It's a Docker container managed for free, for now, by but hosted almost for free on AWS for a few more months as of the date of this post. If you can spare 1GB of HD space and 100MB of memory to host my program, let me know!

Monday, January 19, 2015

Learn About Gaussian Processes


Gaussian processes (GP's) are awesome! Not only can you approximate functions with data, but you also get confidence intervals! (Gaussian processes are an example of why I think some statistics knowledge is fundamental to data science.)

I was motivated to learn about GP's because I wanted to learn about Bayesian optimization because I wanted to optimize neural network hyperparameters because I wanted to reduce training time and find the best network configuration for my data.

However, despite being mostly trained in science and engineering, it took me a while to get an understanding of GP's partly because of my not so rigorous statistics training. Add to that that I'm used to analytical formulas being used to describe functions...not some statistical distribution. So the main point of this post is to share what I did to learn, at least the basics, of GP's in the hope it helps others.

I went through the following materials to learn the basics of GP.

Main material:
You might think that diving right into (textual) documents introducing GP's would help you understand GP's more efficiently. But, I found that starting with video lectures was a more productive use of my time (in the beginning). "mathematicalmonk"'s and Nando de Freita's videos are top video search results. While they are great videos to learn from, I found that David MacKay's video is better because he makes effective use of computers to learn the math and gives better insights!

Support material:
- linear algebra operations and numerical considerations
- statistics: REALLY make sure you understand the difference between marginal, conditional, and joint distributions in a more abstract sense. Then you should be able to follow a derivation of the conditional distribution of a multivariate Gaussian distribution.

Exercise: IPython errr Jupiter Notebook:
I took a script that calculates GP's from Nando de Freita's machine learning course and made it into a document attempting to explain details of the code. Please comment regarding my shortcomings in fully understanding the code.

Finally, a remark about NumPy (as well as other matrix-oriented code) after going through this code: Now I love NumPy but trying to decipher NumPy code that is anything but straightforward matrix operations can be stressful! Writing a perhaps more verbose but simpler code makes such code easier to follow. And when speed is an issue for this simpler code, you might want to reach for Julia or Numba...but that's a separate discussion!