Announcement Announcement Module
Collapse
No announcement yet.
Distribute third-party libraries to all nodes Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distribute third-party libraries to all nodes

    What is the correct way to distribute third party jar files to all nodes running the Map-Reduce job. Shall I use 'libs' attribute in the job configuration? Or distributed cache? Also, it is more than one jar that needs to be distributed (xalan, serializer etc.)

    Also, is there a way to ensure that any of these options have worked correctly - checking a log or job file etc? Because I have tried both of these and have not been able to run the job successfully, so not sure if I am setting these correctly.

    I configured Distributed cache as below, where these jars are available on the indicated hdfs path.
    <hdp:cache create-symlink="true">
    <hdp:classpath value="/svjain/lib/xalan-2.7.1.jar" />
    <hdp:classpath value="/svjain/lib/serializer-2.7.1.jar" />
    </hdp:cache>

    When trying the job (libs) option, I tried below configuration, where install.dir.win and library.path refer to non-hdfs path (library.path=lib/*.jar)
    <hdp:job id="tempjob"
    input-path="xxx" output-path="xxx"
    mapper="xxx"
    jar-by-class="xxx"
    libs="${install.dir.win}/${library.path}"/>

  • #2
    xalan and serializer should not be needed as the JRE/JDK provides such libraries.
    Your configuration looks correct however note that when running from a Windows client, you need to change the JDK path separator.
    See http://static.springsource.org/sprin...tributed-cache for more information.

    Comment


    • #3
      Thanks Costin. I wanted to use Apache Xalan rather than the libraries bundled with Java (the xsl I am using don't compile with Sun version). So, is distributed cache the right way or using 'libs' attribute in job configuration? I have a feeling that these are not getting copied to the nodes, so any way to verify it?

      Comment


      • #4
        The distributed cache helps with the classpath and, if you follow the links I've mentioned, the jars will be copied over to the nodes - you can double check this by looking into the Hadoop logs.
        However the classpath does not take precedence over the libraries included with the JVM so in this case you need to use the endorsed mechanism [1].
        As far as I know, Hadoop doesn't provide any support for it out of the box - so you would have to manually ship the libraries on each node in the dedicated JVM folder.
        Basically it's not so much a Hadoop but rather a JVM problem...

        [1] http://docs.oracle.com/javase/6/docs...des/standards/

        Comment


        • #5
          Thanks. I cannot control the Hadoop environment (classpath in hadoopenv.sh) nor the Java installation since it is a shared Hadoop environment. However to ensure that the Apache transformer is used rather than the Sun implementation, I am passing the following Java options to the map-reduce environment. This seems to work well, but I am struggling with getting these jars to the nodes.
          <hdp:configuration>
          fs.default.name=${fs.default.name}
          mapred.job.tracker=${mapred.job.tracker}
          mapred.map.child.java.opts=-Djavax.xml.transform.TransformerFactory=org.apache .xalan.processor.TransformerFactoryImpl
          </hdp:configuration>

          Comment


          • #6
            Then make sure the distributed cache actually works - if properly configured, the jars will be made available in your job classpath and which point the mapred.map.child should be picked up.

            There might be various reasons why the property is not picked up so I recommend playing around with it - for example first do some small tests to see whether the xalan processor is available in the classpath. Then check whether the properties you are settings are not final and actually passed to the JVM.

            By the way, the distributed cache makes the jars available inside HDFS for Hadoop jobs. In your case, you want these jars to be available on the actual file-system so the JVM process can find them when it starts. It might be that w/o actually having the jars there, there's not much you can do with DistributedCache...

            Comment


            • #7
              Costin, Thanks a lot. Seems like the issue was related to path-separator that you indicated in your first reply. My Hadoop cluster is on Linux environment, but I am triggering the job from Windows Spring-Tool-Suite IDE. I checked the Job.xml configuration and could see ";" in mapred.job.classpath.files property-value.

              In the application context now, I have configured the script to set ":" as path-separator just before the distributed-cache configuration and looks like the Job is running fine now.

              Comment


              • #8
                Great - I got bitten by that bug as well earlier in the process (hence the issue raised on the hadoop tracker) - the docs have also been updated to indicate also the workaround (not sure whether you've checked out the latest doc snapshot).

                Comment

                Working...
                X