Majid's Research: July 2014

As I've discovered, Hadoop is not easy to setup let alone compile properly. For some reason the Apache Hadoop distribution doesn't include native libraries for 64-bit Linux. Furthermore, the included 32-bit native library does not include the Snappy compression algorithm. If Hadoop does not find these libraries in native form, it falls back to, I guess, slow or slower java implementations.

So, being the execution speed demon that I am, I went ahead and compiled Hadoop 2.4.1 from source on 64-bit Linux. It was a rough ride!

I generally followed this guide but preferring to download and compile Snappy from source, and yum installing java-1.7.0-openjdk-devel(1). I Used RHEL7 64-bit(2).

After getting the prerequisites, the magic command to compile is:

mvn clean install -Pdist,native,src -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Dcompile.native=true (3)

You'll find the binaries in <your hadoop src>/hadoop-dist/target. I checked access to native libraries by issuing hadoop checknative after exporting the appropriate environment variables (such as in .bashrc. Refer to a Hadoop setup guide).

Non-obvious solutions to difficulties:
(1). yum install java-1.7.0-openjdk won't do! yea openjD(evelopment)k-devel makes sense!
(2). Gave up with Ubuntu
(3). -Dcompile.native=true was responsible for including the native calls to snappy in libhadoop.so. I did not see this in any guide on building Hadoop! Also, my compile process ran out of memory making javadocs, so I skipped it with -Dmaven.javadoc.skip=true

On a personal note, I really got frustrated with trying different things out but I had a sense of satisfaction in the end. It took me 4 days to figure out the issues and I know a thing or two about Linux!
xkcd

One-sentence summary for the window shoppers: The mathematically-trained need to implement sophistication in their code to improve their programming skill (level).

Huge fraction of attendees at #pydata code for a living but not have CS degrees.
— Matt Davis (@jiffyclub) May 4, 2014

Computer science people this post is not for you. Mathematical people that haven't developed programming skill, this post is for you. I'm aiming this post at people who want to get involved in some kind of mathematical computing: scientific computing, statistical computing, or data science where a crucial skill of the job is programming.

I was spurred to write this post by this tweet from Matt Davis backed by personal experience. I don't have formal training in computer science as many software engineering professionals do. Yet, I managed to be at least be functional in programming and conversant with software engineering practice. So I'd like to share my story in the hope that it can benefit others.

When I first started programming, all I cared about was the results that would come out of some mathematical formulation that I wanted to implement (and that's how I was assessed as well). These exercises have pretty much always followed the workflow diagrammed below:

problem/question -> math -> code -> execute -> output -> analyze -> communicate results (feedback loops not shown)

You could get by implementing this workflow by writing quick and dirty code. While writing dirty code maybe appropriate for one-time tasks, you are actually not realizing your full potential if you keep doing this.

In my case, the pursuit of efficiency, flexibility, and just the 'cool' factor led me down the path of actually becoming a better programmer instead of just someone who wrote alot of little programs (using Python had much to do with increasing my skill but that's another story). I attribute the reasons for the increase of my skill to interest in the following:

Improving program logic:

Generalization which leads to abstraction of code. How do I make my code apply to more general cases?
Code Expressiveness. How do I best represent the solution to my problem? What programming paradigm should I use: object-oriented? functional? This is related to generalization and abstraction; and is closely related to code readability and maintainability.
Robustness. How does it handle failure? How does it handle corner cases? (eg. what happens if some function is passed an empty list?)
Portability. So I got this code working on my Windows machine. Will it work on a mac? linux? a compute cluster? 64-bit operating system?
Automation. Eliminate the need for human involvement in the process as much as possible.
Modularity and separation of concerns. As your program gets bigger, you should notice that some units of logic have nothing to do with others. Put in the effort to separate your code into modules. This aspect is also related to code maintainability.
High-performance. Can I make my code run faster? As a major concern for scientific computing and big data processing, you must understand some details of computer hardware, compilers, interpreters, parallel programming, operating systems, data structures, databases, and the differences between higher and lower-level programming languages. This understanding will be reflected in the code that you write.
Do not repeat yourself (DRY). Sometimes, a shortcut to deliver a program is to duplicate some piece of information (because you didn't structure your program in the best way). Resist this temptation and have a central location for the value of some variable so that a change in this variable propagates appropriately throughout your system.

Improving productivity:

Testing. As your program gets larger and more complex, you want to make sure its (modular!) components work as you develop your code. Test in an automated(!) fashion as well.
Documentation. Expect that you'll come back to your code later to modify it. Save yourself, and others(!), some trouble down the road and document what all those functions do.
Source Control. You need to be able to track versions of your code to help in debugging and accountability in teams. A keyword here is 'git'.
Debugging. Stop using print statements. It may be ok for quick diagnosis but just "byte" the bullet and learn to use a debugger. You'll save yourself time in the long-run. Nonetheless, you can minimize your use of a debugger by writing modular and robust code from the start.
Coding Environment. Integrated development environment vs text editor. VI vs Emacs vs Nano vs Notepad++ vs ...etc. Eclipse vs Visual Studio vs Spyder vs ...etc. Read up about these issues.
Concentration. Don't underestimate the importance of sleep and uninterrupted blocks of time. I find that crafting quality code can be mentally taxing thereby requiring my focus. Also, having a healthy lifestyle in general is also relevant. I like to code while listening to chillout music with a moderate intake of a caffeinated drink.
At first I thought this entry was going to be a joke but on second thought it's really not, even though it's not a technical aspect of the work. See, I didn't boldface this point.

So I haven't revealed anything new here but I hope putting this list together has some value. Also, efforts like Software Carpentry can put you on the fast-track towards improving your skill.

But as with every profession, you must (have the discipline to) practice.

Majid's Research

Thursday, July 24, 2014

Compiling Hadoop from source sucks!

Monday, July 14, 2014

DC Energy and Data Summit

Wednesday, July 2, 2014

How the mathematically-trained can increase their programming skill

Blog Archive

Labels