Showing posts with label numpy. Show all posts
Showing posts with label numpy. Show all posts

Sunday, December 2, 2018

Plot entropy can be used to automatically select scatter plot transparency

Problem: As a data scientist, you make scatter plots to assess visually a distribution of points. However, many times the points are too dense which can give you a wrong impression about the point distribution. So, you might adjust the transparency setting (usually called alpha) a few times. Now, let's say you change the dataset. Again, repeat adjusting the transparency.
Solution: Optimize the plot image entropy because it quantifies the 'variety' of color in it.

Why?

When you adjust the transparency, you are eyeballing a measure of image color 'variety'. I went for an information-theortic measure of 'variety'/'dispersion' as opposed to a statistical one (like standard deviation). What I like about information theory is that I can be less mindful of specific statistical distributions and models.

The magic line to calculate color image entropy with scipy/numpy is:
entropy(histogramdd(img, bins=[arange(256)]*3)[0].flatten(), base=2)
where img is shape (width*height, 3) with 8-bit resolution for each color.

The below figures show a monochromatic and color version of a data set scatter plot with alpha (a) and entropy (s) annotated at the top right. The alpha was varied which results in different entropy values. The red annotation signifies the highest entropy.

I think the utility and quantification is easier to see in the monochromatic version perhaps due to it being less 'busy'. But the numbers don't lie!

I'd like to package this somehow if there is interest.

Tuesday, October 30, 2018

RAPIDS for data science signals potential maturation of (big) data science computing

The recent RAPIDS announcement by Nvidia was portrayed as 'data science on GPUs'. In my opinion, it's about the convergence of several trends in data science tools and computing that have been developing over at least 5 years. This convergence naturally materializes as 'data science on GPUs'. Nvidia pounced on the opportunity!

These trends address a dream that I have as a data scientist: I'd like to use pandas and sklearn without having to think about whether the data fits on one machine or not and to use GPUs if available. I also would like to use SQL without having to think about whether the system that I'm executing on is a database. In other words, I'd like to use my preferred programming language and associated libraries without regard to the system that executes the program.

What does this have to do with GPUs? You don't need GPUs to have such a system (*cough* Spark). But it seems like Nvidia, after it made deep learning practically possible, realized there was much more that it can accelerate upstream the data science process pipeline and in the process of doing so, helps achieve this ideal user scenario.

So to try to get an up close look at RAPIDS, I recently went to GTC DC. Pondering over what I learned at the conference, I've realized that RAPIDS has a place in the following trends that all lead to 'rapid' data science iteration.

> The distinctions between analytical databases and data processing systems are blurring.


Spark can do SQL. Spark can do dataframe operations. Some pandas operations resemble SQL operations and vice versa. Functionally, the only thing that a data processing framework needs to have in order for it to be considered an analytical database is data management. The best (only?) effort I've seen so far that tries to bridge data processing and databases from a programmer's aspect is Ibis. For the execution aspect, executing SQL on GPU dataframes (a component of RAPIDS) is here.


> High-performance computing meets data science.


HPC software and hardware addresses distributed computing and accelerated computing on GPUs as well as on CPUs. So what's the problem? Well, traditionally, HPC was less concerned with performing operations over massive amounts of data that would have to flow through some I/O bus bottleneck from a disk. There was a mismatch between HPC hardware & software and data science workloads.

Nonetheless, today, 'deep learning' is a data science application that makes use of HPC capabilities. On the hardware side, the HPC community is more committed to building facilities that can support a variety of workloads including machine learning and 'big data analytics'.

> Data scientist tools are being pushed upstream the data science pipeline.


'Big data analytics' in the context of data science are the steps that happen before model training that often involve relatively simple operations on large amounts of data (that don't fit on one machine) such as joining, selection, filtering, cleaning, and perhaps feature generation. These tasks are usually the responsibility of databases and 'big data' processing systems like Spark.

Unfortunately for data scientists, databases have rigid interfaces and are not easily programmable (SQL is not a programming language, ok?!). Spark offers decent programmability but runs on the Java VM. This is quite foreign to numerical programmers used to R, Python, C, Fortran, and now, Julia.

But that didn't stop data science engineers from pushing their favorite wares into the 'big data' realm.

Currently, the impressive Dask library is pretty much the goto tool for easy Python-based distributed computing (which readily integrates with compiled code for execution on GPUs or CPUs). More recently however, Ray has emerged as another library that can be used for distributed computation which, in my opinion, offers potentially better integrations with RAPIDS than Dask (but that's another subject).

Another interesting work, if all you care about is Tensorflow, is Tensorflow Transform which is a framework to fully integrate the data science pipeline covering both training and serving situations in one swoop.

> Data science libraries are decoupling their interface from their execution.


It should be easy to argue that numpy, pandas, and sklearn have been successful. Unfortunately, the use of these tools is generally tied to a single CPU on a machine. Nonetheless, due to their success, their interfaces have become models to emulate for distributed data science (ok, pandas not so much :/ ). For example, Ray and Dask have 'distributed pandas'. Nvidia's ML algorithms are copying the sklearn interface. As another example, frameworks used for deep learning like keras and Tensorflow are just thin interfaces that talk to an execution engine.

As a side benefit, this decoupling should allow one to use their favorite programming language to interact with these compute systems.

> Machine learning is becoming more automated.


You can imagine that, if you structure and parameterize your machine learning pipeline, it becomes an optimization problem that would benefit greatly from  being able to execute many configurations of the pipeline in order to find the best model. See this excellent survey on automated machine learning.

> Distributed machine learning workflows are being developed.


These days you can easily request a compute cluster from, say, Amazon, and be productive within about 10 minutes. Acquiring and deploying software to distributed systems is comparatively easy these days compared to just five years ago.

Kubernetes has been important in allowing developers to focus more on building applications instead of managing a distributed system. Now, users shouldn't have to deal with Kubernetes directly but they can be given a 'handle' to the cluster that details the available resources provided by Kubernetes.

As a result, Kubeflow was developed as a solution to manage machine learning deployment on Kubernetes. Also, Pachyderm manages data science workflows on Kubernetes.

Conclusions


While exciting, RAPIDS isn't mature yet. At the moment, RAPIDS is a set of libraries that need to be tied together to make a user-friendly experience. This is more challenging than typical software as developers have to consider, 1., distributed computing and, 2., GPU computing in addition to, 3., machine learning algorithms. I hope this added complexity doesn't reduce the number of possible contributors.

Personally, I'd like to see some more top-down efforts to achieving the data scientist's dream of converged, distributed, and accelerated data science by defining some interface(s) to perform data science (sklearn has emerged as an excellent model!)*. What I'm seeing so far is more bottom-up: Nvidia implementing some algorithms, Anaconda providing Dask for distributed computations, and Arrow providing the data structures. As a user, I shouldn't have to make software choices corresponding to different hardware scenarios: single CPU, multiple CPU single machine, CPUs on multiple machines, single GPU, multiple GPUs single machine, GPUs on multiple machines, and even heterogeneous hardware situations (Is this asking for too much?!).

The usability of RAPIDS is critical to its success. RAPIDS is supposed to enable 'rapid' iteration. One can only rapidly iterate if the RAPIDS workflow is as easy to use as what has been developed for 'traditional' single CPU workflows minding that CPUs might be better for some cases.

But even with an ideal data scientist user experience, GPU databases will still have their place as databases. However, I expect that the GPU databases will accommodate RAPIDS workflows by at least providing low-resistance data interchange with RAPIDS components via GPU dataframes.

---
* Maybe a shim can be made between a user and sklearn that intercepts calls and dispatches to distributed, possibly GPU-equipped,  systems if such a system is available and implements the requested call.

Thursday, February 19, 2015

Bayesian Optimization Demo Game: Can you beat the 'computer'?

...an interactive matplotlib (GTK3) app served on the web.
app (not working properly on Chrome): [http://bayeisan-optimization.25800bfd.svc.dockerapp.io:8000/bo]
[https://github.com/majidaldo/GTK3webserver]


I recently made a notebook about Gaussian processes (GP). Now, Gaussian processes are used by Bayesian optimization. In Bayesian optimization, the goal is to optimize a function with as few (probably expensive) function evaluations as possible. Here's a good tutorial.

To see how it works, I implemented a basic optimizer following the mathematics introduced in the referenced tutorial (which wasn't as difficult as the math behind the GP!). I began from the code for the GP, to see how a Bayesian optimizer (BO) would optimize a toy 1-D problem.

But then, I wondered how the performance of a human would compare to the BO. So I made a game! In this game, the goal is to, of course, find the maximum of the function is as few tries as possible. Both players in each trial will attempt to find the maximum. A running average of the number of tries is kept.

After some experimentation playing the game, it seems there is a lower bound for the average number of tries needed. For the way I have my game set up (50 possible choices), the BO needed 10 point something +/-  point something tries. Furthermore, trying different so-called acquisition functions for the BO did not have much of an effect. This figure is based on thousands of trials (central limit theorem at work). And playing as the creator of my game, it was difficult for me to keep below 10 tries as an average. Of course, I can't play thousands of times.

These results imply that a human would not be able to optimize a function in three dimensions or more in less tries than a BO. There are simply too many variables to keep track of and it is very difficult if not impossible to visualize which is part of the point of using BOs.


Some features of the program:

  • For aesthetic reasons, the y-axis has to be fixed based on the range of values of the function. But then the visual clue of having a point close to the top edge of the plot gives the human player an unfair advantage. The solution is to add some random margin between the maximum and the edge. So, if you happen to get a point that is close to the top edge, choose its neighbors!
  • To generate a 'random' smooth function, I get some normally distributed points and sprinkle them on the domain. Then I use splines to connect them. It works surprisingly well!

Implementation notes:

The programming started in a functional style which is what you'd expect out of mathematical code and it's what I'm used to. However, once I got into matplotlib's GUI stuff, things started to get a little messy as a GUI requires reactive event-driven programming. Both styles of programming exist in my program and they intersect at while loops. You can run the game script with python on your desktop from the command line.

Now, the work involved in putting the game on the web, so that you, the reader, can easily engage in it, deserves its own post or two! Once I had the 'desktop' application running, I swam through oceans of technology until I got it to the form presented here. I should mention here that I came across a program similar to mine on Wolfram|Alpha but I can't find it anymore.

So here is the game served on the web. It's a Docker container managed for free, for now, by tutum.co but hosted almost for free on AWS for a few more months as of the date of this post. If you can spare 1GB of HD space and 100MB of memory to host my program, let me know!



Monday, January 19, 2015

Learn About Gaussian Processes

[http://nbviewer.ipython.org/github/majidaldo/demos/blob/master/gp.ipynb]

Gaussian processes (GP's) are awesome! Not only can you approximate functions with data, but you also get confidence intervals! (Gaussian processes are an example of why I think some statistics knowledge is fundamental to data science.)

I was motivated to learn about GP's because I wanted to learn about Bayesian optimization because I wanted to optimize neural network hyperparameters because I wanted to reduce training time and find the best network configuration for my data.

However, despite being mostly trained in science and engineering, it took me a while to get an understanding of GP's partly because of my not so rigorous statistics training. Add to that that I'm used to analytical formulas being used to describe functions...not some statistical distribution. So the main point of this post is to share what I did to learn, at least the basics, of GP's in the hope it helps others.

I went through the following materials to learn the basics of GP.

Main material:
You might think that diving right into (textual) documents introducing GP's would help you understand GP's more efficiently. But, I found that starting with video lectures was a more productive use of my time (in the beginning). "mathematicalmonk"'s and Nando de Freita's videos are top video search results. While they are great videos to learn from, I found that David MacKay's video is better because he makes effective use of computers to learn the math and gives better insights!

Support material:
- linear algebra operations and numerical considerations
- statistics: REALLY make sure you understand the difference between marginal, conditional, and joint distributions in a more abstract sense. Then you should be able to follow a derivation of the conditional distribution of a multivariate Gaussian distribution.

Exercise: IPython errr Jupiter Notebook:
I took a script that calculates GP's from Nando de Freita's machine learning course and made it into a document attempting to explain details of the code. Please comment regarding my shortcomings in fully understanding the code.

Finally, a remark about NumPy (as well as other matrix-oriented code) after going through this code: Now I love NumPy but trying to decipher NumPy code that is anything but straightforward matrix operations can be stressful! Writing a perhaps more verbose but simpler code makes such code easier to follow. And when speed is an issue for this simpler code, you might want to reach for Julia or Numba...but that's a separate discussion!

Thursday, February 14, 2013

Mathematica vs. Maple vs. SAGE/scipy UX

This post comes from significant experience with Maple and numpy/scipy/python.

I've been using Maple and numpy for a few years. I stopped upgrading my Maple at version 13 because improvements didn't compel me to pay for a newer version for my use. I chose Maple as a happy medium between the numerical and matrix-focused MATLAB and the symbolically-focused Mathematica. They can all do symbolics and matrix operations but they differ in how elegantly they're performed. But, as I gave Maple more complex mathematical tasks such as the calculus involved with Fourier transforms, and symbolic tensor & matrix operations, it became clear that it wasn't up to the task. I knew Mathematica was superior for those tasks but I chugged along with Maple until I got to George Mason University where they had a site license for Mathematica.

Right off the bat I could see a clear superiority in the user experience (mind that I didn't use Maple beyond ver 13). Vectors and matrices are just lists. In Maple there are objects called vectors, lists, tables, and matrices and you sometimes had to import them from a module. Clearly, in just this area, Mathematica is superior. Mathematica includes alot of functionality out of the box. In contrast, in Maple, I had to import modules for what I considered basic functionality which makes the experience not as consistent.

In comparison to other math environments:

  • SAGE: SAGE uses a web interface which can't come close to a desktop application's functionality. The UI does not come close to Mathematica. Also, the open-source symbolic packages it uses are primitive compared to Maple let alone Mathematica. Oh and good luck installing it on Windows.
  • numpy/scipy: Numpy and scipy give mathematical functionality as opposed to being a math environment; basic ones at that. But, its indexing coolness does exist in Mathematica.
  • Mathematica has pattern-matching..others don't enough said.


For math-centric programming, Mathematica is a high bar to get to in terms of consistency, documentation, functionality, and general experience. My general gripe about open-source is the lack of an umbrella vision that moves a project in a certain direction. SAGE is the best (the only) hope but it has a long way to go. Even in the best scenario, getting different python packages seamlessly working together is doubtful (eg. look at my sym2num function).

Conclusion: If all you need is basic functionality, Octave, SAGE, scipy, sympy, and maxima could suit you. For more advanced tasks, be prepared to pay up.

PS: This post will change as I learn more.
Support for my opinion.
Support for my opinion.