Announcement Announcement Module
Collapse
No announcement yet.
Status of spring-hadoop and reporting for batches Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Status of spring-hadoop and reporting for batches

    Hi,

    While wandering on Github, I found the spring-hadoop project. SpringSource have done a terrible work with it. What is its status?

    I can't find any documentation. Is there any? The Springsource section for spring-hadoop does not point to any. (http://www.springsource.org/spring-data)

    I have been using Hadoop with spring for almost one year at my current company, a French startup called Kadeal. We built our own wrapper around it. It allows us to run the jobs and to configure them with a programmatic API. Type verification of input/output values is done at compile time using generics.

    Spring-hadoop promises a tighter integration of Hadoop with the Spring framework and could indeed be really useful.

    Interesting points are :
    * extension of Resource for Hdfs
    * custom NamespaceHandler for configuration
    * support of the last Hadoop API ie o.a.h.mapreduce instead of o.a.h.mapred
    * extension of ConversionService for mapping simple types to simple Writable types
    * abstraction of the mapreduce framework ie no direct hadoop dependencies
    * JobTemplate and GenericJobRunner

    I just didn't find out yet :
    * how would you configure a hadoop job runned by GenericJobRunner?
    Let's say I want to change a business threshold, what would be the best way to do it? Of course, a custom configuration might not be only a single value but lots of properties...

    * how reporting is handled?
    It might be a bad decision but we used Hadoop counters for reporting broad actions of our jobs. Let's say I want to process 300 000 items and that 100 000 have been discarded. I would like to know why, even though I would not want to read every reason for every item one by one.

    With spring-hadoop, there is no more direct access to the Hadoop API within the mapreduce functions. How would you build a crude reporting? What are the best practices?

    I have in fact the same question for spring-batch. We built a crude reporting system based on a core class which looks like Hadoop's counters.

    Regards,
    Bertrand Dechoux

    PS : Spring-hadopp is officially a subproject of spring-data. But at the same time, Hadoop is a nice tool for batches... So posting on this forum (spring-batch) makes sense from my point of view.

  • #2
    Originally posted by bertrandDechoux View Post
    While wandering on Github, I found the spring-hadoop project. SpringSource have done a terrible work with it. What is its status?
    I assume you mean "terrific work", despite the niggles? Status is under development, but very early stages so far, and all feedback and contributions will be hugely appreciated, especially from anyone who has been using Spring and Hadoop. I wouldn't use it in a real project unless you are prepared for some API changes, but it would be great if you tried it out and pushed its limits a bit.


    I can't find any documentation. Is there any?
    Not yet, sorry. We haven't had time, but you seem to have worked out quite a lot from the source code, so that's good.

    I just didn't find out yet :
    * how would you configure a hadoop job runned by GenericJobRunner?
    Let's say I want to change a business threshold, what would be the best way to do it? Of course, a custom configuration might not be only a single value but lots of properties...
    You have all choices available to you that are available to any Spring application - e.g. PropertyPlaceholderConfigurer with Properties sourced from any location you like (including HDFS since it has a Resource abstraction). The typical pattern for that would be to pass in -D arguments and use those to locate environment specific configuration or properties files.

    Spring Hadoop also aims to offer facilities for copying data from the Spring context into Hadoop Configuration (so it is transported to the remote nodes), and back again - look at various FactoryBean instances in the current codebase. I don't think this area is very polished yet, so feel free to make suggestions.

    GenericJobRunner is also in a very preliminary state, and it's quite likely that we will add some support for additional command line options. I don't want to complicate it too much though, so I'd be interested to hear what you would like to do exactly.

    * how reporting is handled?
    It might be a bad decision but we used Hadoop counters for reporting broad actions of our jobs. Let's say I want to process 300 000 items and that 100 000 have been discarded. I would like to know why, even though I would not want to read every reason for every item one by one.
    This area is completely undeveloped. You can definitely help if you have some code you can share. So far we have only exposed the obvious stuff from the MapReduce framework to our Spring POJO model (e.g. Mapper, Reducer, InputFormat, etc.), but we are open to suggestion for what to do next.

    I have in fact the same question for spring-batch. We built a crude reporting system based on a core class which looks like Hadoop's counters.
    Spring Batch has a lot of listener callback interfaces, e.g. SkipListener or Item*Listener would help with your failed items requirement. If you want stuff collected centrally from a distributed system there is the ExecutionContext (and maybe ExitStatus in the StepExecution and JobExecution).

    PS : Spring-hadopp is officially a subproject of spring-data. But at the same time, Hadoop is a nice tool for batches... So posting on this forum (spring-batch) makes sense from my point of view.
    Posting here is fine since I don't look at Spring Data forums much. I agree that there isn't an obvious connection, but I wouldn't get hung up on it. Spring Hadoop is also not specifically tied to Spring Batch, so it might in the end deserve it's own product management efforts. For now it is camping with Spring Data because Mark Pollack offered to include it, but it might end up anywhere.

    Comment


    • #3
      I assume you mean "terrific work", despite the niggles?
      I meant "formidably great". But since I am not a native English speaker, I sometimes does use a strange vocabulary.

      "Terrific" because after inspecting the code, I understand the approach (at least a part) and its usefulness, yet I would not have thought about it myself.

      Spring Hadoop also aims to offer facilities for copying data from the Spring context into Hadoop Configuration (so it is transported to the remote nodes), and back again - look at various FactoryBean instances in the current codebase. I don't think this area is very polished yet, so feel free to make suggestions.
      Thanks for the hint. I will definitely look at them.

      This area is completely undeveloped. You can definitely help if you have some code you can share. So far we have only exposed the obvious stuff from the MapReduce framework to our Spring POJO model (e.g. Mapper, Reducer, InputFormat, etc.), but we are open to suggestion for what to do next.
      This area interests me without doubt. But before making any suggestion, I need to take the time to reflect on what we have built internally and what spring-hadoop proposes currently. It is good to have many features but it is even better if it is within a good, simple design as spring-hadoop shows it.

      Spring Batch has a lot of listener callback interfaces, e.g. SkipListener or Item*Listener would help with your failed items requirement. If you want stuff collected centrally from a distributed system there is the ExecutionContext (and maybe ExitStatus in the StepExecution and JobExecution).
      Our reporter has been built using those listeners as they indeed provide a good insight of the batch life cycle. I will look more precisely at the ExecutionContext and ExitStatus.

      Thanks a lot for your feedback.

      Comment


      • #4
        Hi,
        I am also very interested in this project. At my company, we are using spring batch for batch processing jobs, and I find it to be very limiting. Hadoop serves some of our purposes better.

        I am having issues with spring-hadoop though. The repositories on this page: http://www.springsource.org/spring-data/hadoop don't seem to work. I can't build the artifact locally due to MANIFEST.MF not being found.

        I added
        Code:
        private static final long serialVersionUID = 7707L;
        to HadoopException to help it compile. But I don't know how to fix the Manifest exception:
        Code:
        [ERROR] BUILD ERROR
        [INFO] ------------------------------------------------------------------------
        [INFO] Error assembling JAR
        
        Embedded error: Manifest file: /Users/rbanerjee/spring-hadoop/spring-hadoop-core/target/classes/META-INF/MANIFEST.MF does not exist.
        Any help would be much appreciated. I could fork and put my changes into a pull request if I can get this artifact to build

        Thanks,
        Rajat

        Comment


        • #5
          I can build from master (with warnings from bundlor about the MANIFEST), but I fixed that anyway and pushed the change, so can you try again? A pull request would be more than welcome. Make sure you fill in the contributor's agreement (link on the README) first. Thanks for the interest - all contributions are more than welcome including feature requests.

          Comment


          • #6
            Cool, thank you. I have been able to build, install the jar, and use some of the dependencies in my project.

            My project was already working as a standalone hadoop job. I was running up against a brick wall while trying to wire all of my beans manually, then get JDBC and Hibernate to work for my DAO's. Hopefully this stuff will help!

            I am trying to get my job running from the command line now. I'll post with my results. Thanks. I'll look for opportunities to add some code.

            Comment

            Working...
            X