Announcement Announcement Module
Collapse
No announcement yet.
Thread Safety While Running Multiple Instances Of The Same Job Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thread Safety While Running Multiple Instances Of The Same Job

    I have a simple job which reads from one database and write in another.
    I have reader as JDBCCurserItemReader which reads only records which are not processed i.e. processed = 0 and Writer has Insert statement.
    I have chunk size of 100. I mark the record in source table as "processed=1"
    in writer's beforeWrite() method. This processed flag updation in during writing and within transaction so if anything fails it will be rolled back.

    I tested this job by running only one instance. It works fine without any problem.

    Now my question is I want to run multiple instances of the same job by changing version parameter at the same time. So if I start say 3 instances of the same job then they will be working on same set of data.
    Is there a possibility that different jobs will pickup same chunk? If 2 jobs pickup same record they will add it to the destination table duplicate records as there are no constraint on destination table.

    Correct me if I am thinking on wrong path.
    Let's say first job pickup first 100 records and processing them. It processed and marked 80 records. However commit interval is 100 so it is not yet commited in database. If second instance of the same job kicks in will it pickup the first 100 records as a chunk?

    Please help me because we are planning to run multiple instances of the same job in production for faster and better performance. Is it advisible to run multiple instances of the same job for faster performance? Or I should do multithreading within the job itself.

  • #2
    That's a very interesting question, and I am in a similar situation and would like to add the following specific question:

    Actually I even want to use the StaxEventItemReader in several Jobs that have to run at the same time, and according to the Javadoc it's not thread-safe. What exactly could happen in such a scenario if I registered all jobs using something like a TaskExecutorLauncher (from the adhoc job sample)?

    Comment


    • #3
      Kmisaal: you are doing the right things with your process indicator flag. But it really doesn't make sense to run more than one job concurrently using the same indicator. I guess you need a separate indicator column for each job?

      dmarzi: your question is only partly related. You just need to prevent two jobs from using the same *instance* of the writer (as well as making sure that the instances are writing to different files of course). This is easy to do - just create two bean instances. Or put them in different application contexts to be on the safe side (that's how all the samples work).

      Comment


      • #4
        That's simple indeed. Thanks.

        Comment


        • #5
          Oh, and it seems like the original post was expanded significantly after I made mine. Or I wasn't very concentrated when I read it.

          Comment


          • #6
            Hi Dave.
            Creating and maintaining a processed indicator per job instance seems to be little overhead and may kill the performance.

            Can you suggest some configuration best practices or multi-threading best practices within the job? Is there any thumb rule while deciding the chunk size for faster performance? Do we have any control over JDBC Batch Update?

            Will it improve performance If I merge the functionality of reader and writer into a single tasklet step by merging the select and insert queries (i.e. something like insert into XYZ values select * from PQR). This will eliminate the row mapper.

            Please put your view on how to use spring batch framework efficiently.
            Thanks for your prompt reply.

            Comment


            • #7
              Can you suggest some configuration best practices or multi-threading best practices within the job? Is there any thumb rule while deciding the chunk size for faster performance? Do we have any control over JDBC Batch Update?
              The optimal chunk size for throughput varies according to the data and the database configuration. Usually it will be in the region of 100 or so. This, however, has nothing to do with threading. Batch updates can be done using the BatchSqlUpdateItemWriter.

              Will it improve performance If I merge the functionality of reader and writer into a single tasklet step by merging the select and insert queries (i.e. something like insert into XYZ values select * from PQR). This will eliminate the row mapper.
              Probably. If you can do the whole step in a single SQL statement that is likely to be much more efficient.

              Comment


              • #8
                Thanks Dave.

                I can combine everything in one single sql and execute the sql in tasklet.
                My question is if I execute the whole query in tasklet will I get the liberty of chuck processing. I think it will process everything in one go and we will lose the advantage of chunk?

                Secondly I can execute the combined query in ItemWriter. However ItemWriter requires ItemReader. In this case I can have blank reader which do nothing.

                Please suggest which is the best way and how we can get advantage of chunk processing if we merge the query.

                Comment


                • #9
                  I'm not sure there is any advantage in chunk processing if you can do your update in a single query. A blank ItemReader would behave identically to a Tasklet, so please use the latter.

                  Comment


                  • #10
                    Hey Dave

                    I tried merging everything in one query and executed it in writer.
                    It works however there is no chunk processing. There is one huge query getting executed for 2 hrs.

                    In this case if say something goes wrong in processing last record then it will rollback all the successfully processed record. This is a big loss if we don't do chunk processing.

                    Secondly I tried analyzing the bad performance area in the code and identified that updating processed flag for every record in afterWrite() method is killing us. Is there a way I can issue only one update statement for all the records in the chunk once the chunk is processed successfully. i.e If chunk size is 100. I want to issue only one update after the chunk which will mark all 100 records as processed rather than 100 update statements.

                    Please suggest.

                    Comment


                    • #11
                      Interesting. I was aware of this issue but no-one raised it before - 2.0 will have a different listrener / writer contract so it will be easier there. What you have to do for now is store up the items (if multithreaded use a transaction resource like in the BatchSqlUpdateItemWriter) in an ItemReadListener, and then flush them in one query in a ChunkListener. Remember you can write one class that implements bith interfaces.

                      Comment


                      • #12
                        Partitioned the source data.

                        Hey Dave.

                        I tried one scenario like this.

                        As I already told I am using processed flag.
                        Before starting any job I partition the source table data into 3 parts.
                        I set the value of processed flag = 'JOB1' for first part processed = 'JOB2' for second part and processed = 'JOB3' for the third part. Each part contain say 10K records. My chunk size is 100.

                        Now I run the 3 instances of my job passing the job parameter as job=JOB1 for first instance job = JOB2 for second instance and job = JOB3 for third instance.
                        In the reader query I use this passed job parameter in where clause as

                        select * from table1 where processed = < job parameter value >

                        This will ensure me that each job instance is working on logically separate data set. I run these 3 instances of the same job concurrently. I have only one job configuration file. Each job will set processed=1 once it successfully process the row.

                        This works good and shows good performance gain. Up to this point everything is good good hmm ;-)

                        The problem I faced when some instance fails other instances status is not getting correctly updated into spring batch tables. Even after 2 instances complete and one failed. Status for Completed instance is shown "Started" in the table.

                        I hope you got my scenario and question. Thank you for being patient and reading it completely. Please tell me am I doing the right thing ?

                        Comment


                        • #13
                          That's pretty cool actually. Nice work.

                          I don't understand the question though. You are running three jobs concurrently. What is it that goes wrong? You are interrupting one of them, and the others are being affected. Are you sure they are not sharing state? Set them up each with a separate ApplicationCOntext to be on the safe side (e.g. use ClassPathXmlApplicationContextJobFactory).

                          Comment

                          Working...
                          X