Announcement Announcement Module
Collapse
No announcement yet.
Spring Batch - database input, database output Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spring Batch - database input, database output

    Hi,


    I need to build a batch process to read records( in pages) from the database, feed each record to a record processing handler who will process this input and store some results to the database. I would like to use a multithreaded pool where each thread will process one record.
    My questions are:
    1) Can Spring batch deal with such situations?
    2) Is there any Spring thread pool available?
    3) Assuming Spring batch can provide this functionality are there any limitations I have to keep in mind?

    Many thanks for helping with this,

    Stefan

  • #2
    You do have to be careful with restartability and synchronization of the input source. We recommend a "process indicator" pattern in the input data (or staging table as in the sample) - this is described in the reference guide (http://static.springframework.org/sp...ns.html#d0e573). and also in Wayne Lun'ds talk at TSE, which should be available on the website See the parallelJob sample for an example. N.B. The best idiom for this kind of thing will change with the m4 release, when we start providing chunk-oriented processing.

    Also note that the thread pool model is not Spring Batch - we just use the TaskExecutor strategy from Spring Core (which see).

    Comment


    • #3
      does anybody know where to find a copy of Wayne's talk....I can't seem to locate it....thanks

      Comment


      • #4
        I believe you need to have attended TSE to get the recorded presentations.

        All the 'process indicator approach' entails is creating a flag in the data that marks, definitively whether or not the record has been processed. It requires an extra column in your input, but doesn't require any extra information to be persisted about what has been processed, and is easy to restart by adding a simple where clause to your SQL statement (WHERE process_indicator != y)

        Comment


        • #5
          Yes....I am aware of that technique...used it many times in the past.....thanks for the response...when I looked up the description of the talk (by Wayne Lund) it looked like it had a lot of good information in general.

          Comment


          • #6
            Send me a PM with your email address and I'll send you a copy of the presentation.

            Comment


            • #7
              Originally posted by Dave Syer View Post
              You do have to be careful with restartability and synchronization of the input source. We recommend a "process indicator" pattern in the input data (or staging table as in the sample) - this is described in the reference guide (http://static.springframework.org/sp...ns.html#d0e573). and also in Wayne Lun'ds talk at TSE, which should be available on the website See the parallelJob sample for an example. N.B. The best idiom for this kind of thing will change with the m4 release, when we start providing chunk-oriented processing.

              Also note that the thread pool model is not Spring Batch - we just use the TaskExecutor strategy from Spring Core (which see).
              Hi Dave,

              Thanks for your invaluable input. Indeed, our envisioned architecture will dump raw data files in corresponding staging tables as the prerequisite for the spring batch processing. Our tables will have a column called ProcessedFlag (just a suggestion) which will be set accordingly based on the outcome of processing individual table rows. Now, from reading the parallel spring batch processing pattern my understanding is that I have to run multiple single threaded java processes where each process deals with pre-defined data range.
              This is not what I have in mind. My solution proposal is to use single multithreaded java process (can scale to many multithreaded processes) to process data from the stanging area, validate, transform, amalgamate it and finally store it in the system's application database.
              We will use the multithread pool provided by the Spring core which I assume it can be seamlessly integrated with the Spring batch.

              My best regards,

              Stefan

              Comment


              • #8
                That sounds like the parallelJob sample from Spring Batch. Did you look at that? I think we might provide more than just a sample at some point, but for now you can adapt the sample to your needs quite easily, by the sounds of it.

                Comment


                • #9
                  Originally posted by Dave Syer View Post
                  That sounds like the parallelJob sample from Spring Batch. Did you look at that? I think we might provide more than just a sample at some point, but for now you can adapt the sample to your needs quite easily, by the sounds of it.
                  I cannot find this example in the "spring-batch-1.0.0.m3-with-dependencies.zip" which I downloaded.
                  Should I look under http://springframework.svn.sourcefor.../spring-batch/ ?

                  !!!! Also I am planning to use Spring Core V2.0.6 to integrate with Spring batch. Do you see any problem with this? This is very important to me (to use Spring core v2.0.6 ) as this is our supported enterprise Spring version.

                  Thanks a lot for your help,

                  Regards,

                  Stefan
                  Last edited by phanae; Jan 20th, 2008, 10:08 AM.

                  Comment


                  • #10
                    Sorry, I forgot, the parallelJob was added just after m3. You can get it from SVN or from the snapshot builds (backporting to m3 should be trivial up to this point in time).

                    As far as 2.0.x goes, we haven't started testing yet, but we will, and I know there are projects using 2.0.x. With x=6 I think you should be OK, but we are only going to test against the latest release (currently x=8). If you need help just ask on the forum.

                    Comment


                    • #11
                      Originally posted by Dave Syer View Post
                      Sorry, I forgot, the parallelJob was added just after m3. You can get it from SVN or from the snapshot builds (backporting to m3 should be trivial up to this point in time).

                      As far as 2.0.x goes, we haven't started testing yet, but we will, and I know there are projects using 2.0.x. With x=6 I think you should be OK, but we are only going to test against the latest release (currently x=8). If you need help just ask on the forum.
                      Thanks Dave,

                      We will be trying it and keep you posted.

                      Regards,

                      Stefan

                      Comment


                      • #12
                        Hi, thanks for the reply!

                        I have a query returning a list of records; for each record in that list there's a subquery returning child records! The subqueries are parameterized with data returned from master record. And all this data will be in the text file.

                        The generated file will looks like with this:

                        Master Record 1
                        Child Record 1 from Master Record 1
                        Child Record 2 from Master Record 1
                        Child Record 3 from Master Record 1
                        ...
                        Master Record 2
                        Child Record 1 from Master Record 2
                        Child Record 2 from Master Record 2
                        Child Record 3 from Master Record 2
                        Child Record 4 from Master Record 2
                        ...

                        and so on...


                        So, I guess, no, I can't return all data in the same query because the records aren't all the same type!

                        By using multiple steps... one approach would be be to read all master-records in a staging table and another step to read slave-records in another staging table! But I think it doesn't solve the issue because I still need maintain two cursor's, one for the staging table and another to the subquery.

                        Comment


                        • #13
                          One way to do it would be to have your ItemReader query to find all of the Master records. Then you could write your ItemWriter so that for each Master that it receives, it queries for all of the Child records and then writes everything. Something like:

                          Code:
                          public void write(List<? extends Master> items) {
                              for(Master item : items) {
                                  List<Child> children = queryForChildren(item);
                                  //write Master and all Child records
                              }
                          }
                          Alternatively, you could do the same thing in an ItemProcessor and then inject the list of Child records onto the Master.

                          Comment

                          Working...
                          X