Thursday, July 24, 2014

Compiling Hadoop from source sucks!

As I've discovered, Hadoop is not easy to setup let alone compile properly. For some reason the Apache Hadoop distribution doesn't include native libraries for 64-bit Linux. Furthermore, the included 32-bit native library does not include the Snappy compression algorithm. If Hadoop does not find these libraries in native form, it falls back to, I guess, slow or slower java implementations.

So, being the execution speed demon that I am, I went ahead and compiled Hadoop 2.4.1 from source on 64-bit Linux. It was a rough ride!

I generally followed this guide but preferring to download and compile Snappy from source, and yum installing java-1.7.0-openjdk-devel(1). I Used RHEL7 64-bit(2).

After getting the prerequisites, the magic command to compile is:
mvn clean install -Pdist,native,src -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Dcompile.native=true (3)

You'll find the binaries in <your hadoop src>/hadoop-dist/target. I checked access to native libraries by issuing hadoop checknative after exporting the appropriate environment variables (such as in .bashrc. Refer to a Hadoop setup guide).

Non-obvious solutions to difficulties:
(1). yum install java-1.7.0-openjdk won't do! yea openjD(evelopment)k-devel makes sense!
(2). Gave up with Ubuntu
(3). -Dcompile.native=true was responsible for including the native calls to snappy in I did not see this in any guide on building Hadoop! Also, my compile process ran out of memory making javadocs, so I skipped it with -Dmaven.javadoc.skip=true

On a personal note, I really got frustrated with trying different things out but I had a sense of satisfaction in the end. It took me 4 days to figure out the issues and I know a thing or two about Linux!

No comments:

Post a Comment