Announcement Announcement Module
Collapse
No announcement yet.
Shipping multiple dependencies with Cascading job. Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shipping multiple dependencies with Cascading job.

    We're having an issue with getting dependencies deployed to the remote jobtracker running our Cascading job.

    Our job is a pretty straightforward Cascading job:

    Code:
    <bean id="flowDef" class="my.flow.factory.Class"
          factory-method="flowDef" c:_0="/input/path" c:_1="/output/path"/>
    
    <hdp:cascading-flow id="wc" definition-ref="flowDef" write-dot="dot/wc.dot" jar-setup="true" jar-by-class="my.flow.factory.Class" />
    <hdp:cascading-cascade id="my-cascade" flow-ref="wc"/>
    <hdp:cascading-tasklet id="my-cascade-tasklet"
                           unit-of-work-ref="my-cascade" wait-for-completion="true"/>
    
    <batch:job id="my-cascading-job">
        <batch:step id="cascade-step">
            <batch:tasklet ref="event-history-cascade-tasklet"/>
        </batch:step>
    </batch:job>
    We are using jar-by-class because we have some custom operations in our flow. Before we added jar-by-class, we were getting a ClassNotFound exception.

    We no longer get a ClassNotFound exception for our custom operations, but we are now seeing ClassNotFound for a Cascading class:

    Code:
    java.io.IOException: Split class cascading.tap.hadoop.io.MultiInputSplit not found
      at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:392)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
      at org.apache.hadoop.mapred.Child.main(Child.java:249)
    Caused by: java.lang.ClassNotFoundException: cascading.tap.hadoop.io.MultiInputSplit
      at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
      at java.lang.Class.forName0(Native Method)
      at java.lang.Class.forName(Class.java:247)
      at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:861)
      at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:390)
      ... 7 more
    Is there a good way to ship multiple JARs and add them to the classpath? The relevent documentation doesn't mention this case:
    Note that no jar needs to be setup - the Cascading namespace (in particular cascading-flow, backed by FlowFactoryBean) tries to automatically setup the resulting job classpath. By default, it will automatically add the Cascading library and its dependency to Hadoo DistributedCache so that when the job runs inside the Hadoop cluster, the jars are properly found. When using custom jars (for example to add custom Cascading functions) or when running against a cluster that is already provisioned, one can customize this behaviour through the jar-setup, jar and jar-by-class. For Cascading users, these settings are the equivalent of the AppProps.setApplicationJarClass().

  • #2
    With Cascading this issue shows up fairly fast since the Cascading configs are actual Java classes. SHDP does a best effort (as mentioned in the docs) to make the Cascading classes available but as you've discovered, any user class will be ignored since the framework cannot know which one to use.
    We tried automatic scanning/class detection but that tends to be slow and most often then not have unintended consequences as either too many or too little classes are being pulled in.
    So rather then reinventing the wheel, for such cases it's best to use the same approach that vanilla Cascading recommends, and that is to pack the classes that you need, along with Cascading (if it's not available on the cluster) in a jar and then point SHDP to it.

    This approach is somewhat inconvenient for those working inside an IDE but as a framework, there's only so much we can do to supplement the tooling.

    I've added an improvement in SHDP by retuning the behavior of setup-jar parameter to the Cascading flow factory.
    That is, one can now use add-cascading-jar (by default true) to have the Cascading jars added automatically to the classpath through the distributed cache.
    In your case, this means you can just point the configuration to your jar w/o having to bundle cascading as well - if you do, then you can just disable this behavior.

    Any build from #558 [1] including has this option - see the project homepage on how to get the nightly build (i.e. use our maven repository).

    [1] https://build.springsource.org/brows...OP-NIGHTLY-558

    Comment

    Working...
    X