Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job. Add libjars optionThe first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must useGenericOptionsParser class. Easiest way is to implement the Hadoop Tool interface as described in post Hadoop: Implementing the Tool interface for MapReduce driver. $ export LIBJARS=/path/jar1,/path/jar2 $ hadoop jar /path/to/my.jar com.wordpress.hadoopi.MyClass -libjars ${LIBJARS} value This will obviously work only when playing with CLI, so how the heck can we add such external jar files when not using CLI ? Add jar files to Hadoop classpathYou could certainly upload external jar files to each tasktracker and update HADOOOP_CLASSPATH accordingly, but are you really willing to bother Ops team each time you need to add a new jar ? Works well on a single server node, but are you going to upload such jar across all of the 10, 100 or even more Hadoop nodes ? This approach does not scale at all ! Create a fat jar
Another approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details). Be aware this Jar will not only contain your classes, but might also include all your project dependencies (such as Hadoop libraries) unless you explicitly exclude them (using provided tag).
Following a “mvn clean package” command, your fat JAR will be located in maven project’s target directory as follows drwxr-xr-x 2 antoine staff 68 Jun 10 09:30 archive-tmp drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 classes drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 generated-sources drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 generated-test-sources drwxr-xr-x 3 antoine staff 102 Jun 10 09:29 maven-archiver drwxr-xr-x 4 antoine staff 136 Jun 10 09:29 myproject-1.0-SNAPSHOT -rw-r--r-- 1 antoine staff 63880020 Jun 10 09:30 myproject-1.0-SNAPSHOT-jar-with-dependencies.jar drwxr-xr-x 4 antoine staff 136 Jun 10 09:29 surefire-reports drwxr-xr-x 4 antoine staff 136 Jun 10 09:29 test-classes
In above example, note the actual size of your JAR file (61MB). Quite fat, isn’t it ?
Use Distributed cacheI am always following such approach when using third-party libraries in my MapReduce jobs. One would say such approach is not elegant, but I can work without annoying anyone from Ops team :). I first create a directory “lib” in my HDFS home directory (“/user/hadoopi/”). You could even use “/tmp”, it does not matter. I then create a static method that
Simply add the following lines to some Utils class.
The only thing you need to remember is to add this class prior to Job submission…
Here you are, your MapReduce is now able to use any external JAR file. (责任编辑:IT) |