Friday, February 20, 2015

Serving GTK3 Applications on the Web

....Scientific Python on the Web: not there yet!

It started with a simple goal: Put my interactive matplotlib desktop application on the web so that others could interact with it easily. I succeeded in the end but it was alot of work which pushed the boundaries of my technology knowledge and programming skill. Since it was alot of work, I'll organize the story of my journey into steps:


0. Try out different (existing) ways of putting matplotlib on the web.

I tried IPython errr Jupiter with different matplotlib HTML5 backends. While certainly cool, none of them implemented GUI functionality. That is, I would get an error because they did not respond to mouse clicks. I'm not starting with zero to be nerdy: Had I gotten IPython to work with mouse position clicks I would not have bothered with developing my solution explained in the following steps. This is also considering that IPython is a poor fit since it's a document while my program can be considered an 'app'.


1. Use a matplotlib backend that supports a GUI that can be displayed in a browser: GTK3

After some research, I settled on using GTK3 which comes with the broadway display server which can serve (individually) the application in an HTML5 page. After playing with GTK3 on Ubuntu Linux I managed to get my program to display in the browser as well as any other GTK3 program such as gEdit. It required a little change in matplotlib.


2. Serve the same program to any user who requests it

This was the most difficult part. The same application needed to be served on demand multiple times, perhaps simultaneously. Just pointing users to the display server is not sufficient. The (normal) overall process needs to be: 1. user requests application 2. display server with the application is started 3. monitor connection to disconnect and clean up after user exits.

I broke down the problem into 'modules' that represent each of these three activities though I worked on them in the order below:

  • 2) Display manager: Working on the display manger came naturally after being able to just display GTK3 applications (step 1). It manages the starting up and shutting down of the display servers and the application that run on them. This code is pretty independent of the other modules although I put in functionality keeping in mind what the other modules need to do.
  • 3) Connection manager: I used websockets to periodically send out a request for the user to confirm that he is active. This happens by simultaneously executing python on the server and javascript on the client. On disconnecting, the display and its associated application is stopped.
  • 1) Request handler: It starts the process of starting a display when the user requests it.


Of course, that's easier said than done. The challenge in coding the display manger was managing the Linux processes and cleanly killing them. The connection manager had me go into event-driven programming and websockets. The request handler was rather straightforward to implement. But I had to extract the javascript from the broadway server to integrate it with my the process. So the webpage is (actually) served from the request handler and not the broadway server. I used the Tornado web framework to program the request handler and the connection manager. While the display manager is associated with the request handler (in the same program). Making sure events were coordinated was difficult!


3. Deploy

Having heard about about how awesome Docker is, I decided to recreate my development environment in a docker container. It was a bit tricky to get the networking working but I'm impressed with what is possible with Docker. Currently, my container is living in tutum.co while actually being served from AWS. Please volunteer to serve my application!


Reflections: Scientific Python on the Web

I have to say that I'm pretty satisfied with the result even though it's not ideal. As soon as as I got the task accomplished, I stopped working on it even though I could sink alot more time in it to improve quality, usability, and flexibility. Personally, I learned ALOT working on this project.

But I was disappointed that over in the R world you can create apps with GUIs on the web much more easily with shiny. They recently added mouse position clicks. I say it's disappointing because python is associated with multiple application domains including GUIs. Now in my research for solutions, I found ways to integrate matplotlib into GUIs but that would require some GUI expertise. There is no obvious solution for people used to the scientific python stack as to which GUI framework to use if web publishing is a concern. The people used to the scientific stack are not GUI experts nor are they web developers.

Having said that, what I like about my solution is that there is a path from matplotlib to a full GTK3 GUI application. So you can start with (simple) matplotlib elements and then if you decide you need the functionality of a real GUI you can integrate your work into the GTK3 framework. I've tried it. So you could have an app that runs on the desktop as well as the web. That is superior to shiny.

Some people have commented on the state of scientific python on the web. As part of the solution, I think somehow documents (html, IPython) and applications (GUIs) need to merge. The web has become a medium to deliver experiences.

Unfortunately, for delivering interactive python on the web, there is still alot of work to be done. But just by myself, I was able to deliver a product, albeit hacky, that serves python applications on the web using open source: Docker, Tornado, scipy/numpy, matplotlib, GTK3, Ubuntu, Linux...etc. Imagine what would happen if the open source community came together to work on this problem. Some components exist but now it has to come together. Hopefully, an open-source solution can be superior to a proprietary one.


---
Introduced at DC Python meetup.

Thursday, February 19, 2015

Bayesian Optimization Demo Game: Can you beat the 'computer'?

...an interactive matplotlib (GTK3) app served on the web.


I recently made a notebook about Gaussian processes (GP). Now, Gaussian processes are used by Bayesian optimization. In Bayesian optimization, the goal is to optimize a function with as few (probably expensive) function evaluations as possible. Here's a good tutorial.

To see how it works, I implemented a basic optimizer following the mathematics introduced in the referenced tutorial (which wasn't as difficult as the math behind the GP!). I began from the code for the GP, to see how a Bayesian optimizer (BO) would optimize a toy 1-D problem.

But then, I wondered how the performance of a human would compare to the BO. So I made a game! In this game, the goal is to, of course, find the maximum of the function is as few tries as possible. Both players in each trial will attempt to find the maximum. A running average of the number of tries is kept.

After some experimentation playing the game, it seems there is a lower bound for the average number of tries needed. For the way I have my game set up (50 possible choices), the BO needed 10 point something +/-  point something tries. Furthermore, trying different so-called acquisition functions for the BO did not have much of an effect. This figure is based on thousands of trials (central limit theorem at work). And playing as the creator of my game, it was difficult for me to keep below 10 tries as an average. Of course, I can't play thousands of times.

These results imply that a human would not be able to optimize a function in three dimensions or more in less tries than a BO. There are simply too many variables to keep track of and it is very difficult if not impossible to visualize which is part of the point of using BOs.


Some features of the program:

  • For aesthetic reasons, the y-axis has to be fixed based on the range of values of the function. But then the visual clue of having a point close to the top edge of the plot gives the human player an unfair advantage. The solution is to add some random margin between the maximum and the edge. So, if you happen to get a point that is close to the top edge, choose its neighbors!
  • To generate a 'random' smooth function, I get some normally distributed points and sprinkle them on the domain. Then I use splines to connect them. It works surprisingly well!

Implementation notes:

The programming started in a functional style which is what you'd expect out of mathematical code and it's what I'm used to. However, once I got into matplotlib's GUI stuff, things started to get a little messy as a GUI requires reactive event-driven programming. Both styles of programming exist in my program and they intersect at while loops. You can run the game script with python on your desktop from the command line.

Now, the work involved in putting the game on the web, so that you, the reader, can easily engage in it, deserves its own post or two! Once I had the 'desktop' application running, I swam through oceans of technology until I got it to the form presented here. I should mention here that I came across a program similar to mine on Wolfram|Alpha but I can't find it anymore.

So here is the game served on the web. It's a Docker container managed for free, for now, by tutum.co but hosted almost for free on AWS for a few more months as of the date of this post. If you can spare 1GB of HD space and 100MB of memory to host my program, let me know!



Monday, January 19, 2015

Learn About Gaussian Processes

Gaussian processes (GP's) are awesome! Not only can you approximate functions with data, but you also get confidence intervals! (Gaussian processes are an example of why I think some statistics knowledge is fundamental to data science.)

I was motivated to learn about GP's because I wanted to learn about Bayesian optimization because I wanted to optimize neural network hyperparameters because I wanted to reduce training time and find the best network configuration for my data.

However, despite being mostly trained in science and engineering, it took me a while to get an understanding of GP's partly because of my not so rigorous statistics training. Add to that that I'm used to analytical formulas being used to describe functions...not some statistical distribution. So the main point of this post is to share what I did to learn, at least the basics, of GP's in the hope it helps others.

I went through the following materials to learn the basics of GP.

Main material:
You might think that diving right into (textual) documents introducing GP's would help you understand GP's more efficiently. But, I found that starting with video lectures was a more productive use of my time (in the beginning). "mathematicalmonk"'s and Nando de Freita's videos are top video search results. While they are great videos to learn from, I found that David MacKay's video is better because he makes effective use of computers to learn the math and gives better insights!

Support material:
- linear algebra operations and numerical considerations
- statistics: REALLY make sure you understand the difference between marginal, conditional, and joint distributions in a more abstract sense. Then you should be able to follow a derivation of the conditional distribution of a multivariate Gaussian distribution.

Exercise: IPython errr Jupiter Notebook:
I took a script that calculates GP's from Nando de Freita's machine learning course and made it into a document attempting to explain details of the code. Please comment regarding my shortcomings in fully understanding the code.

Finally, a remark about NumPy (as well as other matrix-oriented code) after going through this code: Now I love NumPy but trying to decipher NumPy code that is anything but straightforward matrix operations can be stressful! Writing a perhaps more verbose but simpler code makes such code easier to follow. And when speed is an issue for this simpler code, you might want to reach for Julia or Numba...but that's a separate discussion!

Thursday, October 30, 2014

Proprietary Mathematical Programming at STRATA Hadoop World '14

I had previously commented on the lack of analytics-focused companies at STRATA. Now, to my surprise, two of the three big "M's" in proprietary computer algebra systems made a showing: Mathematica and MATLAB (the third being Maple). With the strength of the open-source software movement in "big data" and technical computing, I was wondering about what value they had to offer.

After having some discussions with them, the most important thing they offer is a consistent and tested user experience with high-quality documentation and technical support. This is something open source software needs to work on. I've alluded to this in a post most of two years ago and it is still applicable today and I don't see any major change in the user experience of open source software (hope it works on Windows, hope it compiles, hope the docs are good, hope I can email the author of the code, hope this package with this version number works with that package with at this version number, hope it runs fast...etc.) . I guess it's just a natural consequence of open source software; open source is flexible, cutting-edge, and cool but not necessarily user-friendly.

In a related note, on the hardware side, Cray, a brand name in scientific computing, is leveraging its high-performance computing experience for "big data". This is in contrast to computing on commodity hardware which is what Hadoop was intended for.

This is all in the general trend of technical computing (historically driven by scientific applications) merging with the "big data" world.

Friday, August 29, 2014

For data analysis teamwork: A trick to combine source control with data sharing

For analytical teamwork, we have, primarily, two things to manage to enable accountability and reproducibility: data and code. The fundamental question to answer is what (version of) source and data led to some results.

The problem of managing code is pretty much solved (or at least there are many people working on the problem). See this thing called git for example. As for managing data, it needs more work, but it's being worked on and I only know of dat that can provide some kind of versioning of data akin to git . Keep in mind some databases have snapshotting or timestamping capability. But in the view of dat, a database is just a storage backend since you would version with dat.

Suppose you're not worried about versioning data; perhaps you've got your own way of versioning data or that the data that you're working with is supposed fixed and its appropriate store is a filesystem. Now there are various systems to share the data files but the data would be living in a separate world from the code. Now it's possible to store data in a source control system but it would be generally an inefficient way to deal with the data especially if it's large.

Now wouldn't it be nice to also have source controlled text files such as metadata or documentation interleaved with the data files? To accomplish this, two things need to happen:
  • The data needs to be in the scope of the (source) version control control system but does NOT manage the data files
  • and the file sharing system needs to NOT share the source controlled files.

Let's see how to implement this with git for source control and BitTorrent Sync (BTSync) for file sharing. Your project directory structure will be like:
  • project_folder
    • .gitignore (root) (git)
    • src_dir1
    • src_dir2
    • data_store
      • .SyncIgnore (BTSync, right pane)
      • .gitignore (child) (git, left pane)
      • data_dir
        • readme.txt
        • hugefile.dat
      • data_dir2


Let's explain this setup by examining the *ignore files. By excluding certain files, the *ignore files achieve the two exclusions required.
  • .gitignore (root) is typical and left unchanged.
  • .SyncIgnore tells BTSync to not synchronize, for example, readme files that live in the data directory since they are source controlled (highlighted). (Not essential to the setup but it might be preferable to only sync "input" data files and not files that are generated)
  • .gitignore (child) works with .SyncIgnore. We tell git not to manage files in the data directory as a general rule except the source controlled files (highlighted) that we specified in .SyncIgnore.  We also need to tell git not to manage BTSync's special files and folders. (Not essential to the setup but we can take things further by source controlling .SyncIgnore)
I'm sure there are various ways of dealing with the *ignore files but what I've shown here should apply to many cases.

With this setup, your team achieves better synchronization of project files! Please let me know if I need to explain the system better.

Thursday, July 24, 2014

Compiling Hadoop from source sucks!

As I've discovered, Hadoop is not easy to setup let alone compile properly. For some reason the Apache Hadoop distribution doesn't include native libraries for 64-bit Linux. Furthermore, the included 32-bit native library does not include the Snappy compression algorithm. If Hadoop does not find these libraries in native form, it falls back to, I guess, slow or slower java implementations.

So, being the execution speed demon that I am, I went ahead and compiled Hadoop 2.4.1 from source on 64-bit Linux. It was a rough ride!

I generally followed this guide but preferring to download and compile Snappy from source, and yum installing java-1.7.0-openjdk-devel(1). I Used RHEL7 64-bit(2).

After getting the prerequisites, the magic command to compile is:
mvn clean install -Pdist,native,src -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Dcompile.native=true (3)

You'll find the binaries in <your hadoop src>/hadoop-dist/target. I checked access to native libraries by issuing hadoop checknative after exporting the appropriate environment variables (such as in .bashrc. Refer to a Hadoop setup guide).


Non-obvious solutions to difficulties:
(1). yum install java-1.7.0-openjdk won't do! yea openjD(evelopment)k-devel makes sense!
(2). Gave up with Ubuntu
(3). -Dcompile.native=true was responsible for including the native calls to snappy in libhadoop.so. I did not see this in any guide on building Hadoop! Also, my compile process ran out of memory making javadocs, so I skipped it with -Dmaven.javadoc.skip=true

On a personal note, I really got frustrated with trying different things out but I had a sense of satisfaction in the end. It took me 4 days to figure out the issues and I know a thing or two about Linux!
xkcd