Announcement Announcement Module
Collapse
No announcement yet.
Specifying a JobJar in the Tool Tasklet. Page Title Module
Move Remove Collapse
This topic is closed
X
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Specifying a JobJar in the Tool Tasklet.

    Hi everyone,

    I have a use case, when in my project I need to configure several hadoop Tool jobs, and the way I do is by having the following configuration in the spring.cfg.xml:

    Code:
    <hdp:tool-tasklet id="testId" scope="step" configuration-ref="hadoop-configuration" tool-class="com.test.myClass">
        <!-- Some properties -->
    </hdp:tool-tasklet>
    The jar file, that contains the ToolClass is included as a dependency in my project and it works fine, however there is a problem that I am facing, namely I have several JAR files with dependencies and they have different versions of libraries included on their own and since I have included all these JOB JAR files as dependencies to my project, there are bunch of duplicate classes / libraries which can potentially be different versions.

    So here is my question, is there a way for running a Tool class and by specifying the jar location, like it is possible to do with Hadoop command line arguments, such as -files or -libjars?

    Can you suggest some other method of running Tool classes without loading the actual JAR file in the classpath and without using tool-class argument?

    P.S: I am using spring-data-hadoop version: 1.0.0.M1

    Thanks in advance.


    Sincerely,
    David Gevorkyan

  • #2
    Hi David,

    We currently don't expose these parameters on the Tool namespace (as we do with streaming or job) - this looks like an omission. Can you please raise an issue on our tracker - also if you can, indicate how the command line looks like or what you would like to see in the namespace.

    Cheers,

    Comment


    • #3
      Raised issue https://jira.springsource.org/browse/SHDP-49
      Feel use that to follow progress.

      Comment


      • #4
        Hi Costin,

        Thanks for the quick reply.

        Actually besides just exposing JAR file to the Tool namespace, we also need "-files" parameter, since we have some use cases when we need to provide properties file on fly, dynamically.

        So our command line looks like this:

        Code:
        hadoop jar fullpath:myJar_withDependencies.jar -files fullpath:myProp.properties -Dprop1=value1 -Dprop2=value2 -Dconfig=myProp.properties
        So ideally I want to be able to specify any file (such as property file in the above example) to be uploaded to the cluster and also be able to specify the jar with dependencies to be uploaded to the server.

        So if you can expose the same parameters to the Tool namespace as you have done for the streaming job, that would be great, namely the "file", "archive" and "lib".

        Sincerely,
        David

        Comment


        • #5
          Hi David,

          I'm almost done with exposing the params (file/archive/lib) but I'm not sure about the "jar" param. Hadoop jar currently just calls the Main class of the jar as a way to pass in configuration (the command line arguments). That's not needed in a Spring app since it throws out any existing configuration (including the hadoop one).

          With the upcoming improvements the command above would look like this:

          Code:
          <hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" configuration-ref="hadoop-configuration" 
              properties-location="myProp.properties" files="myProp.properties">  
               <hdp:arg value="data/in.txt"/>   
              <hdp:arg value="data/out.txt"/>       
              prop1=value1
              prop2=value2
          </hdp:tool-runner>
          Note the Tool instance (which can be configured) or class is still required and that's because the Tool (which is just a glorified Main) is executed in-process - we don't create a different JVM for it so we need it to be available. If my understanding is correct in your case, you have a lot of dependencies but that shouldn't be a problem since we only load the tool class - we disregard the rest of the classes and as long as your tool does that as well, there shouldn't be a problem.
          Let me know if this solves your problem and if not why?
          Last edited by Costin Leau; Apr 12th, 2012, 11:38 AM.

          Comment


          • #6
            Commit the updates in master - you can pick the changes in the next snapshot.

            Comment


            • #7
              Hi Costin,

              The issue is we are attempting to replace our current shell script with spring batch. The shell script would look something like:

              hadoop -jobjar job1.jar ...
              ...
              hadoop -jobjar job2.jar ...
              .......
              hadoop -jobjar job10.jar ...


              These job jars have conflicting versions of libraries in them (for example jackson 1.4 and jackson 1.94), and even have different versions of spring contained within them.

              How would you propose handling this case? We can not simply just put all 10 jars in the classpath. Perhaps a classloader approach would work?
              Last edited by davidgevorkyan; Apr 12th, 2012, 01:32 PM.

              Comment


              • #8
                Out of curiosity what does the jar contain? Do you specify a main class or use the MANIFEST.MF instead? And what does the "main" file do? Does it implement certain interfaces or contracts?

                Back to your use case, there are some problems here:

                a. each of your commands, forks a separate VM. In each one the jars are put in the classpath but since each sits in a separate VM, there are no conflicts.
                b. everything is command-line based. This means any configuration used needs to be passed through there (whether it's a bootstrapping property file or not).
                c. the Main class isn't application friendly - as far as it's concerned it's the only app running so it tends to do System.exit() -> we might be able to bypass that (bytecode instrumentation) but I'd like to avoid that if possible since there are a lot of subtleties involved.

                b doesn't make a lot of sense in an app (whether it uses Spring or not) since it simply disregards its context and only looks at the command line. a) and c) might be addressed by using a dedicated classloader.

                I'll try to come up with something however in the meantime you might be able to go around this by pointing directly to the job that the tool/main is setting up. It's not ideal but it's worth giving a try. This can work since SHDP doesn't need or use the job as we're talking care of the Hadoop setup only.

                Comment


                • #9
                  The jar file contains every dependency of the hadoop job. This is standard for older versions of hadoop (newer ones do support something closer to a classpath). Its basically equivalanet to jar -xf *.jar ; jar -cf job1.jar * with a little cleanup.

                  An example job looks something like
                  class Job implements Tool
                  ...
                  static void main(String args[]) {
                  Configuration config = new Configuration();
                  DateTime date = new DateTime();
                  config.setLong(JobConfFactory.CURRENT_DATE_IN_MILL S, date.getMillis());
                  System.exit(ToolRunner.run(config, new Job(), args));
                  }

                  I'm pretty sure to support the hadoop jobjar concept we need to have a tasklet that uploads the jobjar to hdfs, creates a custom classloader, loads the local jobjar in it and then use the existing tasklet code. This would prevent the classpath polution of placing multiple jobjars in the spring batch classpath.
                  Last edited by davidgevorkyan; Apr 12th, 2012, 02:35 PM.

                  Comment


                  • #10
                    Thanks - the information is useful. There might be an easier solution then the one you mentioned but testing will tell whether it works or not.
                    Out of curiosity, do you specify the classname or use the manifest.mf instead?

                    Comment


                    • #11
                      I am actually specifying the classname.

                      Comment


                      • #12
                        Hi David,

                        I've updated the tool support so now a jar file (not available on the classpath) can be specified - the loading process is done on a separate classloader so multiple versions and libraries can be used:

                        Code:
                        <hdp:tool-runner id="tool-jar" tool-class="test.SomeTool" jar="some-tool.jar"/>
                        Note that currently we don't do any copying or unpacking of the jar so things like nested /libs or /classes won't work - I'll add support for these (legacy) formats after (Ortodox) Easter. Feedback is welcome - also the structure of your jars is useful.

                        Cheers,

                        Comment


                        • #13
                          Hi Costin,

                          Thanks for looking into this.

                          Can I get the latest artifact from somewhere?

                          Regarding your question about our jar structure: it doesn't have any nested libs, so it only has META-INF directory and compiled classes in the corresponding package directories.

                          Sincerely,
                          David

                          Comment


                          • #14
                            Of course, see [1]. Simply add the snapshot repo in your gradle/maven build and all of SHDP version and its dependencies (including non-SpringSource) will be downloaded from there.

                            [1] http://www.springsource.org/spring-data/hadoop#maven

                            Comment


                            • #15
                              Thanks Costin,

                              I downloaded the latest snapshot version and found one issue.

                              Seems that you have removed Scope parameter from the latest version.
                              This parameter is required for our cases, since we are constructing arguments based on the jobParameters and these values are returned only in case scope="step" for the tasklet.
                              See an example of tasklet that uses "jobParameters".

                              Code:
                              <hdp:tool-tasklet id="taskletId" scope="step" configuration-ref="hadoop-configuration" tool-class="SomeClass">
                                      <hdp:arg value="${propertyVal1}#{jobParameters['RUN_ID']}${propertyVal2}"/>
                              </hdp:tool-tasklet>
                              Sincerely,
                              David

                              Comment

                              Working...
                              X