Thursday, July 24, 2014

Compiling Hadoop from source sucks!

As I've discovered, Hadoop is not easy to setup let alone compile properly. For some reason the Apache Hadoop distribution doesn't include native libraries for 64-bit Linux. Furthermore, the included 32-bit native library does not include the Snappy compression algorithm. If Hadoop does not find these libraries in native form, it falls back to, I guess, slow or slower java implementations.

So, being the execution speed demon that I am, I went ahead and compiled Hadoop 2.4.1 from source on 64-bit Linux. It was a rough ride!

I generally followed this guide but preferring to download and compile Snappy from source, and yum installing java-1.7.0-openjdk-devel(1). I Used RHEL7 64-bit(2).

After getting the prerequisites, the magic command to compile is:
mvn clean install -Pdist,native,src -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Dcompile.native=true (3)

You'll find the binaries in <your hadoop src>/hadoop-dist/target. I checked access to native libraries by issuing hadoop checknative after exporting the appropriate environment variables (such as in .bashrc. Refer to a Hadoop setup guide).

Non-obvious solutions to difficulties:
(1). yum install java-1.7.0-openjdk won't do! yea openjD(evelopment)k-devel makes sense!
(2). Gave up with Ubuntu
(3). -Dcompile.native=true was responsible for including the native calls to snappy in I did not see this in any guide on building Hadoop! Also, my compile process ran out of memory making javadocs, so I skipped it with -Dmaven.javadoc.skip=true

On a personal note, I really got frustrated with trying different things out but I had a sense of satisfaction in the end. It took me 4 days to figure out the issues and I know a thing or two about Linux!

Wednesday, July 2, 2014

How the mathematically-trained can increase their programming skill

One-sentence summary for the window shoppers: The mathematically-trained need to implement sophistication in their code to improve their programming skill (level).

Computer science people this post is not for you. Mathematical people that haven't developed programming skill, this post is for you. I'm aiming this post at people who want to get involved in some kind of mathematical computing: scientific computing, statistical computing, or data science where a crucial skill of the job is programming.

I was spurred to write this post by this tweet from Matt Davis backed by personal experience. I don't have formal training in computer science as many software engineering professionals do. Yet, I managed to be at least be functional in programming and conversant with software engineering practice. So I'd like to share my story in the hope that it can benefit others.

When I first started programming, all I cared about was the results that would come out of some mathematical formulation that I wanted to implement (and that's how I was assessed as well). These exercises have pretty much always followed the workflow diagrammed below:

problem/question -> math -> code -> execute -> output -> analyze -> publish (feedback loops not shown)

You could get by implementing this workflow by writing quick and dirty code. While writing dirty code maybe appropriate for one-time tasks, you are actually not realizing your full potential if you keep doing this.

In my case, the pursuit of efficiency, flexibility, and just the 'cool' factor led me down the path of actually becoming a better programmer instead of just someone who wrote alot of little programs (using Python had much to do with increasing my skill but that's another story). I attribute the reasons for the increase of my skill to interest in the following:

Improving program logic:

  • Generalization which leads to abstraction of code. How do I make my code apply to more general cases?
  • Code Expressiveness. How do I best represent the solution to my problem? What programming paradigm should I use: object-oriented? functional? This is related to generalization and abstraction; and is closely related to code readability and maintainability.
  • Robustness. How does it handle failure? How does it handle corner cases? (eg. what happens if some function is passed an empty list?)
  • Portability. So I got this code working on my Windows machine. Will it work on a mac? linux? a compute cluster? 64-bit operating system?
  • Automation. Eliminate the need for human involvement in the process as much as possible. 
  • Modularity and separation of concerns. As your program gets bigger, you should notice that some units of logic have nothing to do with others. Put in the effort to separate your code into modules. This aspect is also related to code maintainability.
  • High-performance. Can I make my code run faster? As a major concern for scientific computing and big data processing, you must understand some details of computer hardware, compilers, interpreters, parallel programming, operating systems, data structures, databases, and the differences between higher and lower-level programming languages. This understanding will be reflected in the code that you write.
  • Do not repeat yourself (DRY). Sometimes, a shortcut to deliver a program is to duplicate some piece of information (because you didn't structure your program in the best way). Resist this temptation and have a central location for the value of some variable so that a change in this variable propagates appropriately throughout your system.
Improving productivity:
  • Testing. As your program gets larger and more complex, you want to make sure its (modular!) components work as you develop your code. Test in an automated(!) fashion as well.
  • Documentation. Expect that you'll come back to your code later to modify it. Save yourself, and others(!), some trouble down the road and document what all those functions do. 
  • Source Control. You need to be able to track versions of your code to help in debugging and accountability in teams. A keyword here is 'git'.
  • Debugging. Stop using print statements. It may be ok for quick diagnosis but just "byte" the bullet and learn to use a debugger. You'll save yourself time in the long-run.
  • Coding Environment. Integrated development environment vs text editor. VI vs Emacs vs Nano vs Notepad++ vs ...etc. Eclipse vs Visual Studio vs Spyder vs ...etc. Read up about these issues.
  • Concentration. Don't underestimate the importance of sleep and uninterrupted blocks of time. I find that crafting quality code can be mentally taxing thereby requiring my focus. Also, having a healthy lifestyle in general is also relevant. I like to code while listening to chillout music with a moderate intake of a caffeinated drink.
    At first I thought this entry was going to be a joke but on second thought it's really not, even though it's not a technical aspect of the work. See, I didn't boldface this point.

So I haven't revealed anything new here but I hope putting this list together has some value. Also, efforts like Software Carpentry can put you on the fast-track towards improving your skill.

But as with every profession, you must (have the discipline to) practice.

Friday, April 11, 2014

PCA and ICA 'Learn' Representations of Time-Series

PCA is usually demonstrated in a low-dimensional context. Time-series, however, are high dimensional and ICA might be a more natural technique for reducing dimensionality.

For my current project, I'm not interested in dimensionality reduction per-se; rather I'm interested in how well, given a reduced representation of some base input time-series, how well the algorithm can reproduce a new input. If the new input cannot be recreated well, then it is a candidate for being considered an anomaly.

I've setup an experiment where I generated a bunch of even number sine waves in the domain as (base/training) input to the algorithms plus a constant function. Then I try to reconstruct a slightly different even sine wave, an odd sine wave, and a constant.

The result is that the even sine wave and constant are somewhat reconstructed while the odd sine wave is not. You can see this in the following graphs where the blue line is a 'target' signal and the green line is the reconstructed signal. I get similar results using PCA.
4 waves
5 waves FAIL
constant = 3

There are plenty of mathematical rigor and algorithmic parameters that I haven't talked about but this is a post that requires minimal time and technical knowledge to go through. However, you can figure out details if you examine the ipython notebook.

Monday, February 24, 2014

Strata '14 Santa Clara: Lots of data, not so much analytics

While the quality of the conference was excellent overall, there were too many data pipeline platforms and database types showcased. As a quantitatively-oriented person, I really don't care that much for the latest in-memory database or being able to pipeline my data graphically. In my work, I do everything I can to abstract out data handling particulars. I just care that my algorithms work. I realize I do need to know something about the underlying infrastructure if I want to be able to maximize performance...but why should I??

Now some vendors do have some analytic capability in their platform, but why should I rely on their implementation? I should be able to easily apply an algorithm of my choosing which is built on some kind of abstraction and the vendor should support this abstraction. Furthermore, I should be able to examine analytic code (It's great that 0xdata recognizes that as it is open-source).

This is the 'big data' picture I'm seeing; and I'm not liking the silos each vendor tries to make and the (current?) focus on data platforms. The VC panel emphasized that the value from 'big data' is going to come from applications (which of course relies on analytics). Maybe the reported data scientist shortage has something to do with this?

Please inform me if I'm missing something.

This is in contrast to what I've seen at PyData where perhaps the culture of attendees is more quantitative and technical with a firm grasp of mathematics as well as computing issues. In that conference infrastructure use was dictated by analytics in a top-down fashion.

Saturday, January 4, 2014

Extremely Efficient Visualization Makes Use of Color and Symbols

Not too long ago, I posted a 'heat map' visualization and I said it was an efficient visualization because the plot space was filled with color. But it represented only one quantity.

Well now I took that idea further and made a visualization that represents three (or four depending on how you count) quantities. The following picture represents flow around a cylinder with the following quantities:
- flow potential     background color
- flow direction     direction of arrows
- flow magnitude   length of arrows
- pressure              color of arrow
I even threw in a max and min pressure point in there too.

No need for multiple plots or a third dimension!

Tuesday, December 24, 2013

PyData NYC 2013.. Breeding Ground for Highly-Innovative Data Science Tools

I recently attended the PyData conference in New York City. I'd like to give my general impression (as opposed to showing you details you can get by visiting the links on the website). The impression I want to express is that of the culture at the conference: a good way. That is, tools written by non-specialists to tackle problems facing their application domain where they are specialists in. This is partly due to the flexibility of Python and partly due to the highly-technical profile of the attendees of the conference.

This is just an impression so it's very subjective and I'm biased due to my enthusiasm for Python and my scientific computing background which Python has a firm grip on. But try comparing PyData to Strata which is a higher-level view of the data science world with R being the main analysis tool. The broader data science world is a collision of computer science and statistics both in theory and the tools used. While the narrower PyData world has its roots in the more harmonious scientific computing world where math meets computing albeit the math is generally less statistical.

Until data science curricula are developed, I believe the scientific computing background is a better foundation for data science than statistics alone or computer science on its own. Computational, data, and domain expertise are present in skilled scientific programmers, some of whom attended the conference. The caliber of talent at the conference was astounding. Attendees could, for example, talk about specific CPU and GPU architectures, database systems, compiled languages, distributed systems, GUIs; as well as talk about monte carlo techniques, machine learning, and optimization. Such broad knowledge of all of these areas is important for the implementation of a speedy scientific workflow which happens to be necessary for data science as well.

I'm also going to claim there are more tech-savvy scientists than there are tech-savvy statisticians. This isn't to diminish the importance of statisticians in data science but the computational pride of statisticians is in the comprehensive but slow and inelegant R language. Meanwhile scientific programmers know about, and have built eco-systems around, C/C++, Fortran, and Python and all the CS-subjects associated with these languages including parallel programming, compilation, and data structures. This is just the natural result of statisticians traditionally working with "small data" while scientific programmers often work with "big computation".

The mastery of these issues within the same community is what allows for innovative products such as scidb, bokeh, numba, and ipython and all the various small-scale hackery presented at the conference.

Perhaps I should hold off on making such claims until I go to the next Strata conference but this is a blog journal and I'm very late in writing about PyData!