Announcement Announcement Module
Collapse
No announcement yet.
Additional Reducers Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Additional Reducers

    Hi!

    I want to have multiple reducers in a Hadoop job to recreate the sorting problem described in the following Bog post:
    http://www.philippeadjiman.com/blog/...-partitioning/

    Unfortunately the number-reducers property of the hadoop job does not work for me. Also I failed to set the property over the JobFactoryBean.

    I have 2 VM's running as master/slave. Maybe there are some restrictions, which I don't understand, so there is just one reducer started.

    For a longer description of the Problem + configuration snippets, see the post on stackoverflow.
    http://stackoverflow.com/questions/1...le-hadoop-node

  • #2
    You've already figured out how to set the number of reducers - you can go a step further and simplify the config using the namespace.
    As for your test, if you want to get a FactoryBean and not the result it produces, you should ask for its name with "&" in front.

    As for your main question, the number of reducers is driven by the number of input splits - the parameter is really just a hint. See http://wiki.apache.org/hadoop/HowManyMapsAndReduces

    Comment


    • #3
      Hi!

      Do you know a good way to retrace the decisions of hadoop concerning the number of reducers? Are there appropriate logfiles? (e.g. we have a certian number of mappers, + a certian number of machines, therefore n reducers are used on machine a, m reducers on machine b ...)

      Comment


      • #4
        Sorry no - probably googling around and the hadoop mailing list is the best option. There are multiple parameters to tweak the hadoop runtime (some are even conflicting). The fundamental issue/topic here is tweaking the parallelism of the job according to the M/R algorithm. The input itself needs to be properly splittable but one also needs to take into account the storage layer (how big the block size is, whether the input split spans across multiple dfs blocks, etc...).
        As an alternative you can try playing around with an ad-hoc cluster (like Amazon EMR) that resembles the real-world scenario then your own as that will provide more accurate results and you'll be able to see better the impact of your tweaks and whether or not your setup works the way you want.

        Hope this helps

        Comment


        • #5
          Ok, I was able to set the number of mappers and reducers used by a different task with additional parameters in the hadoop call:

          hadoop jar /usr/lib/hadoop-0.20-cdh3u4-examples.jar wordcount -D mapred.reduce.tasks=8 -D mapred.map.tasks=4 /input /output

          how do I configure this in Spring hadoop? I`ve tried to set the number of mappers and reducers for the job in the hadoop context:
          <configuration>
          <!-- The value after the question mark is the default value if another value for hd.fs is not provided -->
          fs.default.name=${hd.fs:hdfs://master:9000}
          mapred.reduce.tasks=8
          mapred.map.tasks=4
          </configuration>

          This shows no effect.

          The Webinterface shows a Map Task Capacity of 2 and a Reduce Task Capacity of 2 on a cluster with pseudodistributed mode. So the job should use more mappers and reducers.

          P.s.: thank you for the help :-)

          Comment


          • #6
            That's the correct way of setting the property (you can also set the property per-job):
            Code:
            <hdp:job ...>
               mapred.reduce.tasks=8
               mapred.map.tasks=4
            </hdp:job>
            Make sure, the configuration is used by your job. If you want to use the jar, you can use the <hdp:jar/> namespace.

            Comment


            • #7
              By the way, note that if you're using a local mapred.job.tracker then Hadoop will manually set num map tasks to 1.
              From JobClient.java:

              Code:
                public void init(JobConf conf) throws IOException {
                  String tracker = conf.get("mapred.job.tracker", "local");
                  tasklogtimeout = conf.getInt(
                    TASKLOG_PULL_TIMEOUT_KEY, DEFAULT_TASKLOG_TIMEOUT);
                  this.ugi = UserGroupInformation.getCurrentUser();
                  if ("local".equals(tracker)) {
                    conf.setNumMapTasks(1);
                    this.jobSubmitClient = new LocalJobRunner(conf);
                  } else {
                    this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);
                  }        
                }

              Comment


              • #8
                Hi!

                I've tried to add the number of reducers in the job configuration.

                Code:
                        <job id="wordcount-job" input-path="${wordcount.input.path:input/wordcount/}"
                                output-path="${wordcount.output.path:output/wordcount/}"
                                mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"
                                validate-paths="false">
                                mapred.reduce.tasks=10
                                mapred.map.tasks=10
                        </job>
                I've uploaded the whole sample repository here, if you want to take a look at the rest (but it should be straight forward and according the tutorial on the Spring website)

                In theory it should be the same as the command
                Code:
                hadoop jar /usr/lib/hadoop-0.20-cdh3u4-examples.jar wordcount -D mapred.reduce.tasks=8 -D mapred.map.tasks=4 /input /output
                and multiple files should show up in the hdfs after job execution.

                Another thing: if the jobs are executed from the spring hadoop project, they never show up in the job history or the web interfaces. Should I configure the JobTracker somewhere? this might be the missing element...
                Last edited by romedius; Oct 9th, 2012, 02:21 AM.

                Comment


                • #9
                  Sorry for the late reply. Whatever arguments you pass to the command line through -D get translated into configuration properties.
                  I've double checked this through a test on a local and remote cluster using the following config:
                  Code:
                      <hdp:job id="ns-job" 
                  ...
                          libs="classpath:mini-hadoop-examples.jar" >
                          mapred.reduce.tasks=8
                          mapred.map.tasks=4
                      </hdp:job>
                  In both tests "mapred.job.reduce.tasks" is 8. The map.tasks are:
                  - 1 - when running inside a local job tracker (no "mapred.job.tracker" is specified by the Hadoop config),
                  - 4 - when running against a proper Hadoop config

                  You can verify this yourself by checking the job configuration (even inside the job).

                  As for your sample, make sure you use spring-data-hadoop 1.0.0.BUILD-SNAPSHOT instead of M2 since it provides a lot of improvements.

                  Comment


                  • #10
                    Is the [code]mini-hadoop-examples.jar[code] the jar built containing the example code?

                    Comment


                    • #11
                      Yes - you can find it in the repo. It's not relevant though (I just used to delimit when the xml element ends and the text starts).
                      You can use any other jar or can skip it all together if you have the examples in the Hadoop classpath.

                      Comment

                      Working...
                      X