Majid's Research: data science

Showing posts with label data science. Show all posts

Sunday, December 2, 2018

Plot entropy can be used to automatically select scatter plot transparency

Problem: As a data scientist, you make scatter plots to assess visually a distribution of points. However, many times the points are too dense which can give you a wrong impression about the point distribution. So, you might adjust the transparency setting (usually called alpha) a few times. Now, let's say you change the dataset. Again, repeat adjusting the transparency.

Solution: Optimize the plot image entropy because it quantifies the 'variety' of color in it.

Why?

When you adjust the transparency, you are eyeballing a measure of image color 'variety'. I went for an information-theortic measure of 'variety'/'dispersion' as opposed to a statistical one (like standard deviation). What I like about information theory is that I can be less mindful of specific statistical distributions and models.

The magic line to calculate color image entropy with scipy/numpy is:

entropy(histogramdd(img, bins=[arange(256)]*3)[0].flatten(), base=2)

where img is shape (width*height, 3) with 8-bit resolution for each color.

The below figures show a monochromatic and color version of a data set scatter plot with alpha (a) and entropy (s) annotated at the top right. The alpha was varied which results in different entropy values. The red annotation signifies the highest entropy.

I think the utility and quantification is easier to see in the monochromatic version perhaps due to it being less 'busy'. But the numbers don't lie!

I'd like to package this somehow if there is interest.

Tuesday, October 30, 2018

RAPIDS for data science signals potential maturation of (big) data science computing

The recent RAPIDS announcement by Nvidia was portrayed as 'data science on GPUs'. In my opinion, it's about the convergence of several trends in data science tools and computing that have been developing over at least 5 years. This convergence naturally materializes as 'data science on GPUs'. Nvidia pounced on the opportunity!

These trends address a dream that I have as a data scientist: I'd like to use pandas and sklearn without having to think about whether the data fits on one machine or not and to use GPUs if available. I also would like to use SQL without having to think about whether the system that I'm executing on is a database. In other words, I'd like to use my preferred programming language and associated libraries without regard to the system that executes the program.

What does this have to do with GPUs? You don't need GPUs to have such a system (*cough* Spark). But it seems like Nvidia, after it made deep learning practically possible, realized there was much more that it can accelerate upstream the data science process pipeline and in the process of doing so, helps achieve this ideal user scenario.

So to try to get an up close look at RAPIDS, I recently went to GTC DC. Pondering over what I learned at the conference, I've realized that RAPIDS has a place in the following trends that all lead to 'rapid' data science iteration.

> The distinctions between analytical databases and data processing systems are blurring.

Spark can do SQL. Spark can do dataframe operations. Some pandas operations resemble SQL operations and vice versa. Functionally, the only thing that a data processing framework needs to have in order for it to be considered an analytical database is data management. The best (only?) effort I've seen so far that tries to bridge data processing and databases from a programmer's aspect is Ibis. For the execution aspect, executing SQL on GPU dataframes (a component of RAPIDS) is here.

> High-performance computing meets data science.

HPC software and hardware addresses distributed computing and accelerated computing on GPUs as well as on CPUs. So what's the problem? Well, traditionally, HPC was less concerned with performing operations over massive amounts of data that would have to flow through some I/O bus bottleneck from a disk. There was a mismatch between HPC hardware & software and data science workloads.

Nonetheless, today, 'deep learning' is a data science application that makes use of HPC capabilities. On the hardware side, the HPC community is more committed to building facilities that can support a variety of workloads including machine learning and 'big data analytics'.

> Data scientist tools are being pushed upstream the data science pipeline.

'Big data analytics' in the context of data science are the steps that happen before model training that often involve relatively simple operations on large amounts of data (that don't fit on one machine) such as joining, selection, filtering, cleaning, and perhaps feature generation. These tasks are usually the responsibility of databases and 'big data' processing systems like Spark.

Unfortunately for data scientists, databases have rigid interfaces and are not easily programmable (SQL is not a programming language, ok?!). Spark offers decent programmability but runs on the Java VM. This is quite foreign to numerical programmers used to R, Python, C, Fortran, and now, Julia.

But that didn't stop data science engineers from pushing their favorite wares into the 'big data' realm.

Currently, the impressive Dask library is pretty much the goto tool for easy Python-based distributed computing (which readily integrates with compiled code for execution on GPUs or CPUs). More recently however, Ray has emerged as another library that can be used for distributed computation which, in my opinion, offers potentially better integrations with RAPIDS than Dask (but that's another subject).

Another interesting work, if all you care about is Tensorflow, is Tensorflow Transform which is a framework to fully integrate the data science pipeline covering both training and serving situations in one swoop.

> Data science libraries are decoupling their interface from their execution.

It should be easy to argue that numpy, pandas, and sklearn have been successful. Unfortunately, the use of these tools is generally tied to a single CPU on a machine. Nonetheless, due to their success, their interfaces have become models to emulate for distributed data science (ok, pandas not so much :/ ). For example, Ray and Dask have 'distributed pandas'. Nvidia's ML algorithms are copying the sklearn interface. As another example, frameworks used for deep learning like keras and Tensorflow are just thin interfaces that talk to an execution engine.

As a side benefit, this decoupling should allow one to use their favorite programming language to interact with these compute systems.

> Machine learning is becoming more automated.

You can imagine that, if you structure and parameterize your machine learning pipeline, it becomes an optimization problem that would benefit greatly from being able to execute many configurations of the pipeline in order to find the best model. See this excellent survey on automated machine learning.

> Distributed machine learning workflows are being developed.

These days you can easily request a compute cluster from, say, Amazon, and be productive within about 10 minutes. Acquiring and deploying software to distributed systems is comparatively easy these days compared to just five years ago.

Kubernetes has been important in allowing developers to focus more on building applications instead of managing a distributed system. Now, users shouldn't have to deal with Kubernetes directly but they can be given a 'handle' to the cluster that details the available resources provided by Kubernetes.

As a result, Kubeflow was developed as a solution to manage machine learning deployment on Kubernetes. Also, Pachyderm manages data science workflows on Kubernetes.

Conclusions

While exciting, RAPIDS isn't mature yet. At the moment, RAPIDS is a set of libraries that need to be tied together to make a user-friendly experience. This is more challenging than typical software as developers have to consider, 1., distributed computing and, 2., GPU computing in addition to, 3., machine learning algorithms. I hope this added complexity doesn't reduce the number of possible contributors.

Personally, I'd like to see some more top-down efforts to achieving the data scientist's dream of converged, distributed, and accelerated data science by defining some interface(s) to perform data science (sklearn has emerged as an excellent model!)*. What I'm seeing so far is more bottom-up: Nvidia implementing some algorithms, Anaconda providing Dask for distributed computations, and Arrow providing the data structures. As a user, I shouldn't have to make software choices corresponding to different hardware scenarios: single CPU, multiple CPU single machine, CPUs on multiple machines, single GPU, multiple GPUs single machine, GPUs on multiple machines, and even heterogeneous hardware situations (Is this asking for too much?!).

The usability of RAPIDS is critical to its success. RAPIDS is supposed to enable 'rapid' iteration. One can only rapidly iterate if the RAPIDS workflow is as easy to use as what has been developed for 'traditional' single CPU workflows minding that CPUs might be better for some cases.

But even with an ideal data scientist user experience, GPU databases will still have their place as databases. However, I expect that the GPU databases will accommodate RAPIDS workflows by at least providing low-resistance data interchange with RAPIDS components via GPU dataframes.

---
* Maybe a shim can be made between a user and sklearn that intercepts calls and dispatches to distributed, possibly GPU-equipped, systems if such a system is available and implements the requested call.

Thursday, October 30, 2014

Proprietary Mathematical Programming at STRATA Hadoop World '14

I had previously commented on the lack of analytics-focused companies at STRATA. Now, to my surprise, two of the three big "M's" in proprietary computer algebra systems made a showing: Mathematica and MATLAB (the third being Maple). With the strength of the open-source software movement in "big data" and technical computing, I was wondering about what value they had to offer.

After having some discussions with them, the most important thing they offer is a consistent and tested user experience with high-quality documentation and technical support. This is something open source software needs to work on. I've alluded to this in a post most of two years ago and it is still applicable today and I don't see any major change in the user experience of open source software (hope it works on Windows, hope it compiles, hope the docs are good, hope I can email the author of the code, hope this package with this version number works with that package with at this version number, hope it runs fast...etc.) . I guess it's just a natural consequence of open source software; open source is flexible, cutting-edge, and cool but not necessarily user-friendly.

In a related note, on the hardware side, Cray, a brand name in scientific computing, is leveraging its high-performance computing experience for "big data". This is in contrast to computing on commodity hardware which is what Hadoop was intended for.

This is all in the general trend of technical computing (historically driven by scientific applications) merging with the "big data" world.

Friday, August 29, 2014

For data analysis teamwork: A trick to combine source control with data sharing

For analytical teamwork, we have, primarily, two things to manage to enable accountability and reproducibility: data and code. The fundamental question to answer is what (version of) source and data led to some results.

The problem of managing code is pretty much solved (or at least there are many people working on the problem). See this thing called git for example. As for managing data, it needs more work, but it's being worked on and I only know of dat that can provide some kind of versioning of data akin to git . Keep in mind some databases have snapshotting or timestamping capability. But in the view of dat, a database is just a storage backend since you would version with dat.

Suppose you're not worried about versioning data; perhaps you've got your own way of versioning data or that the data that you're working with is supposed fixed and its appropriate store is a filesystem. Now there are various systems to share the data files but the data would be living in a separate world from the code. Now it's possible to store data in a source control system but it would be generally an inefficient way to deal with the data especially if it's large.

Now wouldn't it be nice to also have source controlled text files such as metadata or documentation interleaved with the data files? To accomplish this, two things need to happen:

The data needs to be in the scope of the (source) version control control system but does NOT manage the data files
and the file sharing system needs to NOT share the source controlled files.

Let's see how to implement this with git for source control and BitTorrent Sync (BTSync) for file sharing. Your project directory structure will be like:

project_folder

.gitignore (root) (git)
src_dir1
src_dir2
data_store

.SyncIgnore (BTSync, right pane)
.gitignore (child) (git, left pane)
data_dir

readme.txt
hugefile.dat

data_dir2

Let's explain this setup by examining the *ignore files. By excluding certain files, the *ignore files achieve the two exclusions required.

.gitignore (root) is typical and left unchanged.
.SyncIgnore tells BTSync to not synchronize, for example, readme files that live in the data directory since they are source controlled (highlighted). (Not essential to the setup but it might be preferable to only sync "input" data files and not files that are generated)
.gitignore (child) works with .SyncIgnore. We tell git not to manage files in the data directory as a general rule except the source controlled files (highlighted) that we specified in .SyncIgnore. We also need to tell git not to manage BTSync's special files and folders. (Not essential to the setup but we can take things further by source controlling .SyncIgnore)

I'm sure there are various ways of dealing with the *ignore files but what I've shown here should apply to many cases.

With this setup, your team achieves better synchronization of project files! Please let me know if I need to explain the system better.

Wednesday, July 2, 2014

How the mathematically-trained can increase their programming skill

One-sentence summary for the window shoppers: The mathematically-trained need to implement sophistication in their code to improve their programming skill (level).

Huge fraction of attendees at #pydata code for a living but not have CS degrees.
— Matt Davis (@jiffyclub) May 4, 2014

Computer science people this post is not for you. Mathematical people that haven't developed programming skill, this post is for you. I'm aiming this post at people who want to get involved in some kind of mathematical computing: scientific computing, statistical computing, or data science where a crucial skill of the job is programming.

I was spurred to write this post by this tweet from Matt Davis backed by personal experience. I don't have formal training in computer science as many software engineering professionals do. Yet, I managed to be at least be functional in programming and conversant with software engineering practice. So I'd like to share my story in the hope that it can benefit others.

When I first started programming, all I cared about was the results that would come out of some mathematical formulation that I wanted to implement (and that's how I was assessed as well). These exercises have pretty much always followed the workflow diagrammed below:

problem/question -> math -> code -> execute -> output -> analyze -> communicate results (feedback loops not shown)

You could get by implementing this workflow by writing quick and dirty code. While writing dirty code maybe appropriate for one-time tasks, you are actually not realizing your full potential if you keep doing this.

In my case, the pursuit of efficiency, flexibility, and just the 'cool' factor led me down the path of actually becoming a better programmer instead of just someone who wrote alot of little programs (using Python had much to do with increasing my skill but that's another story). I attribute the reasons for the increase of my skill to interest in the following:

Improving program logic:

Generalization which leads to abstraction of code. How do I make my code apply to more general cases?
Code Expressiveness. How do I best represent the solution to my problem? What programming paradigm should I use: object-oriented? functional? This is related to generalization and abstraction; and is closely related to code readability and maintainability.
Robustness. How does it handle failure? How does it handle corner cases? (eg. what happens if some function is passed an empty list?)
Portability. So I got this code working on my Windows machine. Will it work on a mac? linux? a compute cluster? 64-bit operating system?
Automation. Eliminate the need for human involvement in the process as much as possible.
Modularity and separation of concerns. As your program gets bigger, you should notice that some units of logic have nothing to do with others. Put in the effort to separate your code into modules. This aspect is also related to code maintainability.
High-performance. Can I make my code run faster? As a major concern for scientific computing and big data processing, you must understand some details of computer hardware, compilers, interpreters, parallel programming, operating systems, data structures, databases, and the differences between higher and lower-level programming languages. This understanding will be reflected in the code that you write.
Do not repeat yourself (DRY). Sometimes, a shortcut to deliver a program is to duplicate some piece of information (because you didn't structure your program in the best way). Resist this temptation and have a central location for the value of some variable so that a change in this variable propagates appropriately throughout your system.

Improving productivity:

Testing. As your program gets larger and more complex, you want to make sure its (modular!) components work as you develop your code. Test in an automated(!) fashion as well.
Documentation. Expect that you'll come back to your code later to modify it. Save yourself, and others(!), some trouble down the road and document what all those functions do.
Source Control. You need to be able to track versions of your code to help in debugging and accountability in teams. A keyword here is 'git'.
Debugging. Stop using print statements. It may be ok for quick diagnosis but just "byte" the bullet and learn to use a debugger. You'll save yourself time in the long-run. Nonetheless, you can minimize your use of a debugger by writing modular and robust code from the start.
Coding Environment. Integrated development environment vs text editor. VI vs Emacs vs Nano vs Notepad++ vs ...etc. Eclipse vs Visual Studio vs Spyder vs ...etc. Read up about these issues.
Concentration. Don't underestimate the importance of sleep and uninterrupted blocks of time. I find that crafting quality code can be mentally taxing thereby requiring my focus. Also, having a healthy lifestyle in general is also relevant. I like to code while listening to chillout music with a moderate intake of a caffeinated drink.
At first I thought this entry was going to be a joke but on second thought it's really not, even though it's not a technical aspect of the work. See, I didn't boldface this point.

So I haven't revealed anything new here but I hope putting this list together has some value. Also, efforts like Software Carpentry can put you on the fast-track towards improving your skill.

But as with every profession, you must (have the discipline to) practice.

Friday, April 11, 2014

PCA and ICA 'Learn' Representations of Time-Series

PCA is usually demonstrated in a low-dimensional context. Time-series, however, are high dimensional and ICA might be a more natural technique for reducing dimensionality.

For my current project, I'm not interested in dimensionality reduction per-se; rather I'm interested in how well, given a reduced representation of some base input time-series, how well the algorithm can reproduce a new input. If the new input cannot be recreated well, then it is a candidate for being considered an anomaly.

I've setup an experiment where I generated a bunch of even number sine waves in the domain as (base/training) input to the algorithms plus a constant function. Then I try to reconstruct a slightly different even sine wave, an odd sine wave, and a constant.

The result is that the even sine wave and constant are somewhat reconstructed while the odd sine wave is not. You can see this in the following graphs where the blue line is a 'target' signal and the green line is the reconstructed signal. I get similar results using PCA.

4 waves

5 waves FAIL

constant = 3

There are plenty of mathematical rigor and algorithmic parameters that I haven't talked about but this is a post that requires minimal time and technical knowledge to go through. However, you can figure out details if you examine the ipython notebook.

Majid's Research