Announcement Announcement Module
Collapse
No announcement yet.
TaskExecutor Transaction Management Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • TaskExecutor Transaction Management

    I think it would be very common for a batch to use TaskExecutor and the threads will have its own transactions.

    How can this be achieved?

    The threads submitted have common type and implementing common execution method. So I tried to define a <tx:advice> around that method to start the transaction. But when my test case gets to that method, the execution just stopped without returning any exception or System.out.println()....

  • #2
    We agree that it is a common use case of batch processing, which we have called 'Batch Partitioning'. As you said, there would be one transaction per 'Partition' (i.e. thread) The reason we think of it as a partition is because each partition is associated with a range of data. For example, in a data set of 100,000 records, there could be 10 partitions of 10,000 records each. Each partition can then be launched within it's own thread, and manage it's transactions separately. To take the scalability even farther, partitions could be 'sprayed' across multiple Java EE application servers. Those who have seen our presentations at JavaOne or SpringOne may be familiar with the diagram of increasing from a Single JVM/Single thread all the way up to multiple threads in multiple JVM's as the workload increases. This funtionality will be encapsulated in a separate 'BatchContainer' that will come sometime after the first release. The reason for a separate container is because there are a lot of additional concerns with partitioning that would require additional functionality. For example, status would need to be handled differently, since you would need to know the status of the overall step, and each partition within that step. Exception handling is another concern. If a partition fails, do you fail the whole job? Do you allow the other partitions to finish processing, but then list the job as failed? If restarting, do you recalculate the partitions? Of course, much of this should be strategized, but you can see all the additional concerns that necessitate the additional container.

    There are also possible asynchronous approaches using queues in a more producer/consumer model, although each thread wouldn't necessarily have it's own transaction.

    I know this doesn't necessarily answer your question of 'How can this be achieved?', but I wanted to share our view of partitioning with you, since it has been successful on many projects in the past.

    Comment


    • #3
      There are also possible asynchronous approaches using queues in a more producer/consumer model, although each thread wouldn't necessarily have it's own transaction.
      Thanks for your sharing!

      If I partition the records and divide them to each VM, one VM may finish processing earlier than the other because of the difference in processing power. And the overall batch procesing time will depend on the VM that takes longest processing time.

      I'm thinking about one way of running batches in multiple JVM: a database table will store the records to be processed the batches. The producers in different VM will get records from the table and pass them to its consumer for process. The batch will end when no more outstanding records are found in the table. Seems we can achieve better load balancing this way.

      However, the sycnhronization required to ensure that a record won't be processed by two different VM would be quite complex.

      Is there such model in Spring batch? Can you share your view?

      Comment


      • #4
        Keep in mind that you wouldn't really be splitting the data in half, with each half going to a different virtual machine. In most cases you would split the data up to fairly small partitions, but not necessarily working on all at the same time. For example, you could have 20 partitions, but only processing 4 on one jvm and 4 on another. Once a partition finishes, another would start from the thread pool. If one jvm is faster, it might end up processing more partitions than another, in this way there isn't a huge concern about load balancing.

        The producer/consumer model you've mentioned before has been discussed as another option. Most of our discussions have centered around using a Queue in the middle to deal with some of the synchronization concerns you mentioned. For example, an ItemProvider reads in data, (it doesn't have to be a database, although for heavy processing it's a good idea to load into the database first) it maps it into a complete object that represents the Input. This object can then be put onto a queue, and various ItemProcessors can pull items from the queue for processing. There also needs to be a lot of care to make sure that input is potentially throttled if the queue begins to get too full, otherwise there could be a lot of issues with the ItemProviders moving much faster than the processors.

        In general, our view is that both are viable approaches for concurrent processing in batch, but that different scenarios may make one approach more attractive than another.

        Comment


        • #5
          For the producer/consumer model to use Queue between multiple JVM in different machines, does it mean that I need to use some external Queue (like MQ of Websphere) and use JMS to read it?

          Comment


          • #6
            That's correct.

            Comment


            • #7
              Originally posted by lucasward View Post
              Once a partition finishes, another would start from the thread pool. If one jvm is faster, it might end up processing more partitions than another, in this way there isn't a huge concern about load balancing.
              To handle the above between multiple app servers, is queue also required?

              I always feel that setting up MQ is a complex thing, is there anyway that we can use the database to achieve the synchronization problem?

              Comment


              • #8
                No, the first scenario I mentioned was the range approach. Usually, that is implemented with a database. Rather than trying to synchronize access to individual records, you assign each thread a range and let it spin through until finished, which effectively turns each thread into a mini batch job. This is a tried and true technique that has been used for a large ammount of batch jobs. In the past, it was achieved by splitting the jobs up at the scheduler level, and giving them different ranges to operate on by creating partitions in the database and passing them in as arguments. With Spring-batch we aim to move this into the architecture, which removes the burden on the scheduler/ops team and adds a lot of great scalability and load balancing options, and has been proven to be a successful approach on newer projects.

                In terms of the producer/consumer batch model, we've been targeting using a queue in between, but you could use a database if you used a 'process indicator flag' on your table, and made sure that the row stayed locked until done processing, so that you could effectively update it's process indicator once finished.

                However, if you prefer to use a database, I recommend using the range approach.

                Comment


                • #9
                  With Spring-batch we aim to move this into the architecture, which removes the burden on the scheduler/ops team and adds a lot of great scalability and load balancing options, and has been proven to be a successful approach on newer projects.
                  In case of multiple J2EE servers: If this is moved into the architecture, what is the mechanism to partition the range in the database and assign to each app ? How can each server knows what partitions other have taken such that they won't be re-doing each other's job?

                  Comment


                  • #10
                    As Lucas said "you assign each thread a range" of primary keys. The framework probably has to extend its domain to include the concept of a Partition as a first class entity (that's the way our prototype worked). The point is that the application developer is unaware of the details.

                    Comment


                    • #11
                      There are also a number of other partitioning strategies and approaches that are documented within the "4.1 Partitioning Approaches" section of the Batch Processing Strategies page that can be found at: http://static.springframework.org/sp...trategies.html.

                      Comment


                      • #12
                        Trying to answer your question

                        Hi ballsuen,

                        I will try to answer your first question. How do you have configured the StepExecutor? We are using the SimpleStepExecutor and setting the StepOperations to a TaskExecutorRepeatTemplate to achieve multi-threading (same VM, though). This class will create the threads for you and run the ChunkOperations in each configured Thread. So it seems that you should try to add the transaction handling aspect to the RepeatTemplate assigned to the ChunkOperations.

                        Was that the approach that you tried and did not work?

                        Still, I do not know how to do partitioning (we are processing files, so have synchronized the 'next()' of the next file in the ItemProvider which works for all the threads in the same VM).

                        Hope this helps.
                        Regards
                        Andres B.

                        Comment

                        Working...
                        X