Friday, August 29, 2014

For data analysis teamwork: A trick to combine source control with data sharing

For analytical teamwork, we have, primarily, two things to manage to enable accountability and reproducibility: data and code. The fundamental question to answer is what (version of) source and data led to some results.

The problem of managing code is pretty much solved (or at least there are many people working on the problem). See this thing called git for example. As for managing data, it needs more work, but it's being worked on and I only know of dat that can provide some kind of versioning of data akin to git . Keep in mind some databases have snapshotting or timestamping capability. But in the view of dat, a database is just a storage backend since you would version with dat.

Suppose you're not worried about versioning data; perhaps you've got your own way of versioning data or that the data that you're working with is supposed fixed and its appropriate store is a filesystem. Now there are various systems to share the data files but the data would be living in a separate world from the code. Now it's possible to store data in a source control system but it would be generally an inefficient way to deal with the data especially if it's large.

Now wouldn't it be nice to also have source controlled text files such as metadata or documentation interleaved with the data files? To accomplish this, two things need to happen:
  • The data needs to be in the scope of the (source) version control control system but does NOT manage the data files
  • and the file sharing system needs to NOT share the source controlled files.

Let's see how to implement this with git for source control and BitTorrent Sync (BTSync) for file sharing. Your project directory structure will be like:
  • project_folder
    • .gitignore (root) (git)
    • src_dir1
    • src_dir2
    • data_store
      • .SyncIgnore (BTSync, right pane)
      • .gitignore (child) (git, left pane)
      • data_dir
        • readme.txt
        • hugefile.dat
      • data_dir2


Let's explain this setup by examining the *ignore files. By excluding certain files, the *ignore files achieve the two exclusions required.
  • .gitignore (root) is typical and left unchanged.
  • .SyncIgnore tells BTSync to not synchronize, for example, readme files that live in the data directory since they are source controlled (highlighted). (Not essential to the setup but it might be preferable to only sync "input" data files and not files that are generated)
  • .gitignore (child) works with .SyncIgnore. We tell git not to manage files in the data directory as a general rule except the source controlled files (highlighted) that we specified in .SyncIgnore.  We also need to tell git not to manage BTSync's special files and folders. (Not essential to the setup but we can take things further by source controlling .SyncIgnore)
I'm sure there are various ways of dealing with the *ignore files but what I've shown here should apply to many cases.

With this setup, your team achieves better synchronization of project files! Please let me know if I need to explain the system better.

Thursday, July 24, 2014

Compiling Hadoop from source sucks!

As I've discovered, Hadoop is not easy to setup let alone compile properly. For some reason the Apache Hadoop distribution doesn't include native libraries for 64-bit Linux. Furthermore, the included 32-bit native library does not include the Snappy compression algorithm. If Hadoop does not find these libraries in native form, it falls back to, I guess, slow or slower java implementations.

So, being the execution speed demon that I am, I went ahead and compiled Hadoop 2.4.1 from source on 64-bit Linux. It was a rough ride!

I generally followed this guide but preferring to download and compile Snappy from source, and yum installing java-1.7.0-openjdk-devel(1). I Used RHEL7 64-bit(2).

After getting the prerequisites, the magic command to compile is:
mvn clean install -Pdist,native,src -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Dcompile.native=true (3)

You'll find the binaries in <your hadoop src>/hadoop-dist/target. I checked access to native libraries by issuing hadoop checknative after exporting the appropriate environment variables (such as in .bashrc. Refer to a Hadoop setup guide).


Non-obvious solutions to difficulties:
(1). yum install java-1.7.0-openjdk won't do! yea openjD(evelopment)k-devel makes sense!
(2). Gave up with Ubuntu
(3). -Dcompile.native=true was responsible for including the native calls to snappy in libhadoop.so. I did not see this in any guide on building Hadoop! Also, my compile process ran out of memory making javadocs, so I skipped it with -Dmaven.javadoc.skip=true

On a personal note, I really got frustrated with trying different things out but I had a sense of satisfaction in the end. It took me 4 days to figure out the issues and I know a thing or two about Linux!
xkcd

Wednesday, July 2, 2014

How the mathematically-trained can increase their programming skill

One-sentence summary for the window shoppers: The mathematically-trained need to implement sophistication in their code to improve their programming skill (level).


Computer science people this post is not for you. Mathematical people that haven't developed programming skill, this post is for you. I'm aiming this post at people who want to get involved in some kind of mathematical computing: scientific computing, statistical computing, or data science where a crucial skill of the job is programming.

I was spurred to write this post by this tweet from Matt Davis backed by personal experience. I don't have formal training in computer science as many software engineering professionals do. Yet, I managed to be at least be functional in programming and conversant with software engineering practice. So I'd like to share my story in the hope that it can benefit others.

When I first started programming, all I cared about was the results that would come out of some mathematical formulation that I wanted to implement (and that's how I was assessed as well). These exercises have pretty much always followed the workflow diagrammed below:

problem/question -> math -> code -> execute -> output -> analyze -> publish (feedback loops not shown)

You could get by implementing this workflow by writing quick and dirty code. While writing dirty code maybe appropriate for one-time tasks, you are actually not realizing your full potential if you keep doing this.

In my case, the pursuit of efficiency, flexibility, and just the 'cool' factor led me down the path of actually becoming a better programmer instead of just someone who wrote alot of little programs (using Python had much to do with increasing my skill but that's another story). I attribute the reasons for the increase of my skill to interest in the following:

Improving program logic:

  • Generalization which leads to abstraction of code. How do I make my code apply to more general cases?
  • Code Expressiveness. How do I best represent the solution to my problem? What programming paradigm should I use: object-oriented? functional? This is related to generalization and abstraction; and is closely related to code readability and maintainability.
  • Robustness. How does it handle failure? How does it handle corner cases? (eg. what happens if some function is passed an empty list?)
  • Portability. So I got this code working on my Windows machine. Will it work on a mac? linux? a compute cluster? 64-bit operating system?
  • Automation. Eliminate the need for human involvement in the process as much as possible. 
  • Modularity and separation of concerns. As your program gets bigger, you should notice that some units of logic have nothing to do with others. Put in the effort to separate your code into modules. This aspect is also related to code maintainability.
  • High-performance. Can I make my code run faster? As a major concern for scientific computing and big data processing, you must understand some details of computer hardware, compilers, interpreters, parallel programming, operating systems, data structures, databases, and the differences between higher and lower-level programming languages. This understanding will be reflected in the code that you write.
  • Do not repeat yourself (DRY). Sometimes, a shortcut to deliver a program is to duplicate some piece of information (because you didn't structure your program in the best way). Resist this temptation and have a central location for the value of some variable so that a change in this variable propagates appropriately throughout your system.
Improving productivity:
  • Testing. As your program gets larger and more complex, you want to make sure its (modular!) components work as you develop your code. Test in an automated(!) fashion as well.
  • Documentation. Expect that you'll come back to your code later to modify it. Save yourself, and others(!), some trouble down the road and document what all those functions do. 
  • Source Control. You need to be able to track versions of your code to help in debugging and accountability in teams. A keyword here is 'git'.
  • Debugging. Stop using print statements. It may be ok for quick diagnosis but just "byte" the bullet and learn to use a debugger. You'll save yourself time in the long-run.
  • Coding Environment. Integrated development environment vs text editor. VI vs Emacs vs Nano vs Notepad++ vs ...etc. Eclipse vs Visual Studio vs Spyder vs ...etc. Read up about these issues.
  • Concentration. Don't underestimate the importance of sleep and uninterrupted blocks of time. I find that crafting quality code can be mentally taxing thereby requiring my focus. Also, having a healthy lifestyle in general is also relevant. I like to code while listening to chillout music with a moderate intake of a caffeinated drink.
    At first I thought this entry was going to be a joke but on second thought it's really not, even though it's not a technical aspect of the work. See, I didn't boldface this point.

So I haven't revealed anything new here but I hope putting this list together has some value. Also, efforts like Software Carpentry can put you on the fast-track towards improving your skill.

But as with every profession, you must (have the discipline to) practice.

Friday, April 11, 2014

PCA and ICA 'Learn' Representations of Time-Series

PCA is usually demonstrated in a low-dimensional context. Time-series, however, are high dimensional and ICA might be a more natural technique for reducing dimensionality.

For my current project, I'm not interested in dimensionality reduction per-se; rather I'm interested in how well, given a reduced representation of some base input time-series, how well the algorithm can reproduce a new input. If the new input cannot be recreated well, then it is a candidate for being considered an anomaly.

I've setup an experiment where I generated a bunch of even number sine waves in the domain as (base/training) input to the algorithms plus a constant function. Then I try to reconstruct a slightly different even sine wave, an odd sine wave, and a constant.

The result is that the even sine wave and constant are somewhat reconstructed while the odd sine wave is not. You can see this in the following graphs where the blue line is a 'target' signal and the green line is the reconstructed signal. I get similar results using PCA.
4 waves
5 waves FAIL
constant = 3

There are plenty of mathematical rigor and algorithmic parameters that I haven't talked about but this is a post that requires minimal time and technical knowledge to go through. However, you can figure out details if you examine the ipython notebook.

Monday, February 24, 2014

Strata '14 Santa Clara: Lots of data, not so much analytics

While the quality of the conference was excellent overall, there were too many data pipeline platforms and database types showcased. As a quantitatively-oriented person, I really don't care that much for the latest in-memory database or being able to pipeline my data graphically. In my work, I do everything I can to abstract out data handling particulars. I just care that my algorithms work. I realize I do need to know something about the underlying infrastructure if I want to be able to maximize performance...but why should I??

Now some vendors do have some analytic capability in their platform, but why should I rely on their implementation? I should be able to easily apply an algorithm of my choosing which is built on some kind of abstraction and the vendor should support this abstraction. Furthermore, I should be able to examine analytic code (It's great that 0xdata recognizes that as it is open-source).

This is the 'big data' picture I'm seeing; and I'm not liking the silos each vendor tries to make and the (current?) focus on data platforms. The VC panel emphasized that the value from 'big data' is going to come from applications (which of course relies on analytics). Maybe the reported data scientist shortage has something to do with this?

Please inform me if I'm missing something.

This is in contrast to what I've seen at PyData where perhaps the culture of attendees is more quantitative and technical with a firm grasp of mathematics as well as computing issues. In that conference infrastructure use was dictated by analytics in a top-down fashion.

Saturday, January 4, 2014

Extremely Efficient Visualization Makes Use of Color and Symbols

Not too long ago, I posted a 'heat map' visualization and I said it was an efficient visualization because the plot space was filled with color. But it represented only one quantity.

Well now I took that idea further and made a visualization that represents three (or four depending on how you count) quantities. The following picture represents flow around a cylinder with the following quantities:
- flow potential     background color
- flow direction     direction of arrows
- flow magnitude   length of arrows
- pressure              color of arrow
I even threw in a max and min pressure point in there too.

No need for multiple plots or a third dimension!