Announcement Announcement Module
No announcement yet.
Calling webservice as batch task Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling webservice as batch task


    I need to create a service which gets commands with parameters from an input flat file, calls corresponding web service with parameters, and saves command results to an output flat file.

    The number of command in the input file is around 10^6, the number of results in the output file is the same. There are only a few types of commands and all command signatures are well defined.

    The web service method interfaces are designed to support bulk/chunk processing. So there will be (10^6 / commit_interval) remote calls. For marshaling and unmarshaling remote method parameters JAXB is used.

    I create a sample application with Spring Batch as follow:

    The first step loads commands with FlatFileItemReader and maps each row into JAXB object (JAXB structure for web service request). The objects are written into staging table.
    I consider two serialization mechanisms:
    1. custom row mapper for each JAXB request structure (~10 mappers)
    2. serialize row as XML (one generic JAXB mapper)
    What do you suggest?

    I also don't know how to handle that input file will not be processed twice. Which part of whole system is typically responsible for input file deletion? When the file can be deleted safely, after step finished successfully, or after whole job?

    The second step reads saved commands from staging table, call web service method with WebServiceTemplate (from Spring WS project) and save the results (responses) into the output file. I tried to use processor for calling web service. But I was unable to use bulk/chunk functionality of web service, the processor can process only one item per call. Only the writers are designed to process items in chunk. So my step uses only reader for reading commands from staging table, and writer for calling web service and also saving the results. The results are saved into staging table. Each result is JAXB structure received from web service.

    What is the best way to design Spring Batch step for this use case (read request from staging, call service, save response)?

    The last step reads saved results from staging table and export them to the flat file. The mechanism is an analogy to the first step.

    I'm not sure about this job and steps design, are there any traps which I have to consider?

    Could You describe how to manage staging table? How and when remove processed rows from stating table, how many staging tables should I use (or maybe one for all jobs). How many incrementers do I need?

    Thanks if You can help.

  • #2
    Personally, I don't think I would load the file results into a staging table, rather, I would put them on a JMS queue during the batch process, and use a producer/consumer model to pull them off and make the WS calls in a much more concurrent fashion. Making a remote call via a batch process is unperformant enough as it is, but doing it synchronously is even worse.

    Regarding the outputfile, the results from the webservice call, I would put into a table and use a batch process to write it out after you're sure it's been processed. How to do the later depends upon many more details about your solution that I don't have.

    Regarding when to delete the file, that also depends upon your scheduling solution (if there is one) and integration architecture sophistication. You could easily have a step after the output step in SB to delete the file, or you could have a shell script be called by your scheduling solution to do it as well. Either way, using the completion of the job to determine when to delete is probably good, assuming you have an archive process. (I'm not sure it's ever wise to completely delete anything)


    • #3
      Hi Lucas

      Thank you for your help. You are right, the synchronous web service calls (even in chunk) could be a performance bottleneck.

      I describe system architecture a bit, it will help us to discuss details. The web service server is running on JBoss in the clustered environment behind round robin load balancer (with sticky session). The batch service is also running on the Jboss AS and scheduled with Quartz, but it should be possible to run batch jobs outside container (from command line). The batch service stores its state into clustered Oracle. The cluster is transparent to the database clients, the application treats db cluster as single database instance.

      You proposed to distribute processing with JMS, but I'd like to avoid to use any JEE resources (even db connection pool is defined as local data source in the spring context).

      What do You think about following architecture?

      1. On every cluster node there are an inbox/outbox directories for input/output files .
      2. On each cluster node there is independent job for processing their own input files.
      3. Each job has following steps:
      STEP1 reads commands from input file and calls web service in the chunk with multi-thearded executor. The load balancer delegates web service request to all nodes in the cluster. Then the step saves the results into the staging table. I'm not sure how to handle the request/response ordering in parallel processing. Any hint?
      STEP2 reads results from the database and export them as flat file to the outbox.
      STEP3 removes/archives input file, and removes/archive staging table. The staging table can be partitioned by date (for example).

      It is not as sophisticated as JMS solution but it should scale well. But it is my thinking only I don't have much experience with batch processing :-( Could you to take a position to my proposition, please?


      • #4
        Just because you use JMS doesn't mean you have to be tied to an application server. Dave wrote a great article about using Active MQ with a database as it's persistence store, thus negating the need for JTA since it would essentially all be tied to the same local transaction. I can't seem to find the link right now, but I'll see if I can track it down later. You can do the same thing with many databases. I really think that's a better model. As you assert in your example, handling the request/response model will be tricky. It will likely work, but I think your performance will be somewhat limited, along with some other potential issues.


        • #5
          Originally posted by mkuthan View Post
          I'm not sure how to handle the request/response ordering in parallel processing. Any hint?
          TaskExecutorPartitionHandler is actually pretty good for this use case. You have to be able to split the input data, so that might lead to more work and/or another staging table for the commands. But it wouldn't require JMS.

          The link that Lucas mentioned is here:
          Last edited by Dave Syer; Apr 6th, 2009, 02:27 AM. Reason: spelling


          • #6
            I went through David's article (excellent publication) and "shared resource" pattern seems to be my case. Now I use it to perform transactional processing with single data source (one Oracle DB, Hibernate and JDBC data access code in the application). As I understood correctly the article, it will be working also for ActiveMQ persistence, is it?

            Could you clarify my thinking about JMS state/messages propagation in the cluster with the "shared resource" pattern? I'll bundle ActiveMQ into the application, the connection factory will be configured as "vm://localhost", and broker factory will use the shared database. The application can be deployed in the clustered environment and JMS will be working as configuration with one shared queue/topic managed by AS. Each cluster node will have it's own local JMS, and the state will propagate through shared database.

            Do I need configure ActiveMQ in some special way? E.g. define a polling time, how often each ActiveMQ instance should check the state in the shared database? But it's rather question to Active MQ forum.


            • #7
              Your assumptions about the shared state in the cluster are correct. More detailed questions about AMQ and/or Spring JMS would be better off in those forums, but note that the "shared resource" pattern with ActiveMQ is not mainstream (yet), despite the advantages.

              If I were you (sorry Lucas) I probably wouldn't use JMS for this use case unless you already had it in your environment, or needed it for something else. You should at least have a look at the partitioning approach first. If you are clustered then how do the nodes share out work amongst input file(s) anyway? You can't be working from a single file?


              • #8
                Thank you for valuable discussion. I'll look at both approaches with JMS and partitioning.

                JMS seems to easier to development. But I'm afraid of performance with database as shared resource, also the configuration can be tricky.

                Partitioning and concurrent processing is more complex, but the runtime environment is easier to set up.