Majid's Research: viz

Showing posts with label viz. Show all posts

Sunday, December 2, 2018

Plot entropy can be used to automatically select scatter plot transparency

Problem: As a data scientist, you make scatter plots to assess visually a distribution of points. However, many times the points are too dense which can give you a wrong impression about the point distribution. So, you might adjust the transparency setting (usually called alpha) a few times. Now, let's say you change the dataset. Again, repeat adjusting the transparency.

Solution: Optimize the plot image entropy because it quantifies the 'variety' of color in it.

Why?

When you adjust the transparency, you are eyeballing a measure of image color 'variety'. I went for an information-theortic measure of 'variety'/'dispersion' as opposed to a statistical one (like standard deviation). What I like about information theory is that I can be less mindful of specific statistical distributions and models.

The magic line to calculate color image entropy with scipy/numpy is:

entropy(histogramdd(img, bins=[arange(256)]*3)[0].flatten(), base=2)

where img is shape (width*height, 3) with 8-bit resolution for each color.

The below figures show a monochromatic and color version of a data set scatter plot with alpha (a) and entropy (s) annotated at the top right. The alpha was varied which results in different entropy values. The red annotation signifies the highest entropy.

I think the utility and quantification is easier to see in the monochromatic version perhaps due to it being less 'busy'. But the numbers don't lie!

I'd like to package this somehow if there is interest.

Sunday, June 11, 2017

Holoviews Allows for Rapid Data Exploration by Structuring Data

Problem:
Data analysis and visualization are related. You have to set up a new visualization that makes sense every time you want to explore a set of variables. Furthermore, you have to deal with different data formats and plotting libraries.
Solution:
Use HoloViews. It forces you to organize your data in such a way that you can automatically visualize it.

Context:

Making a choice about which plotting package to use in Python used to be simple. There was just matplotlib pretty much. Nowadays, there are a plethora of visualization packages for Python. Jake VanderPlas recently gave an excellent talk at PyCon 2017 that highlights this which is summarized in the below diagram from the talk.

While choice is good, oftentimes I don't need so much choice when it comes to plotting itself as much as I need to be fancy with the process of setting up the plot in such a way that what I'm looking for is revealed in the plot.

This is where tools like HoloViews and Vega/Altair come in. Vega is a plot+data description language. But HoloViews goes a step further in abstraction: you specify relationships in your data and it does the hard work of presenting it in the best way with great flexibility; It's data-oriented, not plot-oriented.

I find explaining Holoviews difficult because most everyone is used to an imperative style of visualization whereas HoloViews can be considered declarative. Once you get past that, you have to distiguish HoloViews from Grammar-of-Graphics-inspired plotting packages like ggplot2 which is declarative as well. The difference is that ggplot declares plots while HoloViews declares data. In fact, to 'get' how HoloViews addresses the data analysis/visualization problem, I had to read the proceeding for it. One unique aspect of HoloViews that can help understand it is that it enforces a separation of data, plot rendering, plot type (given by an appropriate 'view' of the data), and plot style.

This approach to the problem translates into the following advantages:

rapid exploration of data (no, pandas doesn't cut it)
export of the HoloViews object as a self-contained html file with interactive plots
map to data in its original form like numpy, pandas, lists, as well as Blaze which itself accesses a variety of data sources including databases
choose your rendering backend: matplotlib, bokeh, plotly
memory-constrained analysis using DynamicMap
instantly switch between using interactive controls like sliders to explore variables and more static displays like an array of plots for each variable value

Show me one tool that does all that! Python vs R jab: It's been said for years that Python lags behind R in visualization. I would say parity has been achieved once Python got some ggplot-like tools. But with holoviews, I think the better language for visualization has now tipped in favor of Python!

Thursday, February 19, 2015

Bayesian Optimization Demo Game: Can you beat the 'computer'?

...an interactive matplotlib (GTK3) app served on the web.
app (not working properly on Chrome): [http://bayeisan-optimization.25800bfd.svc.dockerapp.io:8000/bo]
[https://github.com/majidaldo/GTK3webserver]

I recently made a notebook about Gaussian processes (GP). Now, Gaussian processes are used by Bayesian optimization. In Bayesian optimization, the goal is to optimize a function with as few (probably expensive) function evaluations as possible. Here's a good tutorial.

To see how it works, I implemented a basic optimizer following the mathematics introduced in the referenced tutorial (which wasn't as difficult as the math behind the GP!). I began from the code for the GP, to see how a Bayesian optimizer (BO) would optimize a toy 1-D problem.

But then, I wondered how the performance of a human would compare to the BO. So I made a game! In this game, the goal is to, of course, find the maximum of the function is as few tries as possible. Both players in each trial will attempt to find the maximum. A running average of the number of tries is kept.

After some experimentation playing the game, it seems there is a lower bound for the average number of tries needed. For the way I have my game set up (50 possible choices), the BO needed 10 point something +/- point something tries. Furthermore, trying different so-called acquisition functions for the BO did not have much of an effect. This figure is based on thousands of trials (central limit theorem at work). And playing as the creator of my game, it was difficult for me to keep below 10 tries as an average. Of course, I can't play thousands of times.

These results imply that a human would not be able to optimize a function in three dimensions or more in less tries than a BO. There are simply too many variables to keep track of and it is very difficult if not impossible to visualize which is part of the point of using BOs.

Some features of the program:

For aesthetic reasons, the y-axis has to be fixed based on the range of values of the function. But then the visual clue of having a point close to the top edge of the plot gives the human player an unfair advantage. The solution is to add some random margin between the maximum and the edge. So, if you happen to get a point that is close to the top edge, choose its neighbors!
To generate a 'random' smooth function, I get some normally distributed points and sprinkle them on the domain. Then I use splines to connect them. It works surprisingly well!

Implementation notes:

The programming started in a functional style which is what you'd expect out of mathematical code and it's what I'm used to. However, once I got into matplotlib's GUI stuff, things started to get a little messy as a GUI requires reactive event-driven programming. Both styles of programming exist in my program and they intersect at while loops. You can run the game script with python on your desktop from the command line.

Now, the work involved in putting the game on the web, so that you, the reader, can easily engage in it, deserves its own post or two! Once I had the 'desktop' application running, I swam through oceans of technology until I got it to the form presented here. I should mention here that I came across a program similar to mine on Wolfram|Alpha but I can't find it anymore.

So here is the game served on the web. It's a Docker container managed for free, for now, by tutum.co but hosted almost for free on AWS for a few more months as of the date of this post. If you can spare 1GB of HD space and 100MB of memory to host my program, let me know!

Saturday, January 4, 2014

Extremely Efficient Visualization Makes Use of Color and Symbols

Not too long ago, I posted a 'heat map' visualization and I said it was an efficient visualization because the plot space was filled with color. But it represented only one quantity.

Well now I took that idea further and made a visualization that represents three (or four depending on how you count) quantities. The following picture represents flow around a cylinder with the following quantities:
- flow potential background color
- flow direction direction of arrows
- flow magnitude length of arrows
- pressure color of arrow
I even threw in a max and min pressure point in there too.

No need for multiple plots or a third dimension!

Thursday, September 12, 2013

Simple plots can reveal complex patterns

Visualization is a big topic on its own, which implies that you can get quite sophisticated in making plots. However, you can reveal complex information from simple plots.

I took a shot at visualizing power generation data from the Kaggle competition. My goal was just to make a "heat map" of the power generation data: for every <week, hour of week>, plot the power generated. Now, I had to rearrange the data a bit but the result was not only pretty, but more importantly, very revealing and efficient. The plot summarizes ~25k data points by revealing cycles over days and months over several years.

Enjoy. Courtesy of pandas for data munging and matplotlib for the plot.

Majid's Research