Announcement Announcement Module
Collapse
No announcement yet.
CSV file process - multi-thread; which class to use? Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • CSV file process - multi-thread; which class to use?

    Here is what I need to do:

    - Receive file ( large CSV file )
    - Import records using one TX but multiple threads ( split file? )
    - The import process uses existing Spring/Hibernate objects.
    - Commit or rollback depending on success
    - Do NOT need restart.

    QUESTION:
    What Class / Method / Strategy best using Spring Batch?
    Please advise.

    Thanks!

  • #2
    Before diving into answering the specific points, I'm curious to ask why multiple threads are needed to load this file? What is the average file size to load? Is the large time to run the job related to processing? If so, using a staging table would be a good approach. In general, I would try loading the file using spring-batch without multi-threading (but using a huge commit interval, assuming the data is fairly clean) and if it isn't performing, start thinking about splitting the file, etc.

    To move on to specific points, you could kick off multiple threads from the ItemProvider, and there has been some discussion about this, but we don't have any concrete examples to refer you to. A TaskExecutorRepeatTemplate could be used at the Chunk level, or a CompositeItemProvider could be used, but there would be issues Synchronizing the file with the transaction, since Spring's TransactionSynchronizationManager stores it's classes to notify in a thread local.

    Comment


    • #3
      Processing time, in answer to your above question.
      If I split the files - why run serially?
      Prior to your reply I was considering using a queue and letting several threads work on it.

      Thanks for your thoughts on this.

      Comment


      • #4
        Are the files arriving split? or do you have to split them? If they're already split, then I agree that it makes sense to try and process them in parallel. If you want to do so within one job, you could use a queue in between the provider and the processor to help, but there would still be issues in synchronizing the disparate file input sources with the transaction.

        If it's processing time that takes awhile, and not the I/O, I would still recommend loading the file directly into a staging table, then doing the processing you need once the data has been loaded into the database.

        Comment


        • #5
          The file arrives UN-split.

          If each process/thread can have its own transaction, then which approach do you see best? What is the advantage of the staging table?

          I appologize for the questions. There are just a myriad of classes I see in the API which makes me need to understand the intended implementations for them.

          thanks!

          Comment


          • #6
            can each thread in a chunkOperations have it's own transaction (out-of-the-box)? I'm no expert, but from my (small) knowledge you may need to make sure the simpleStepExecutor's transaction manager is "dummy" and add transaction support around each repeatIterator (could be done with a RepeatInterceptor or with AOP around the chunkOperations Tasklet).

            Just my 2 cents.
            Regards
            AB

            Comment


            • #7
              staging table

              Originally posted by lucasward View Post
              If so, using a staging table would be a good approach.
              staging table is a good idea if transactional integrity is required.
              You might want to use database temporary tables for staging. Here's one article about this approach for DB2 database:
              High performance inserts using JDBC Type 4 in a constrained environment: Leverage DB2 declared global temporary tables

              Comment


              • #8
                Ok, I am on board with that line of thinking.
                I am thinking through the pros/cons of using a Queue as the repository (aka database).

                Comment


                • #9
                  Originally posted by epleisman View Post
                  I am thinking through the pros/cons of using a Queue as the repository (aka database).
                  Do you mean, using a Queue as your 'staging table'?

                  Comment


                  • #10
                    Yes - Queue as staging table.

                    Comment


                    • #11
                      There have been discussions about this type of approach, but I have yet to hear of an actual implementation. Unlike the approach of staging with a database table, which would require a single staging step first, then a second step to process using the staging table as input, you would need to write a special tasklet that would take the returned item from the provider, and put it in a queue, and each ItemProcessor would need to get the item to process from said queue. You would also need to make sure that there was some way to throttle the producer (ItemProvider) so that it doesn't accidentally add too much data and cause the queue to fail.

                      Again, I would only try this approach if you absolutely have to because of load issues. You could easily write an ItemProvider and ItemProcessor that could stay the same regardless of the solution, and try it without any additional threads.

                      Comment


                      • #12
                        We implemented our batch using only files (xml files, each file one record to be processed) and it is working fine. We did this in part because the existing Business Processes already handled transaction (and it was not easy to handle it from the batch right now) and because uploading XML to the database was not straightforward. Either way, if the transaction is controlled in the service and not in the batch, it doesn't make any difference either way.

                        We created some classes to manipulate files (renaming, moving, and the like) and we manage the state of the file by adding or changing names. The (big) problem with this approach is that the process may commit and the batch VM may fail before renaming the file being processed and hence getting an inconsistent state (it is also easily recoverable)..

                        We are planning on contributing these classes for manipulating files.

                        Just wanted to show another scenario where the staging table may not be as useful.

                        Lucas, do we have an example with multiple chunks? How would the stepOperations and ItemProvider be configured?

                        Comment


                        • #13
                          Originally posted by sotretus View Post
                          because uploading XML to the database was not straightforward.
                          I'm assuming this means importing XML files with spring-batch, if so, keep an eye out for some upcoming changes to XMLInputSources that should be committed tomorrow. Hopefully, it makes that scenario a little more straightforward and extensible.

                          Originally posted by sotretus View Post
                          We created some classes to manipulate files (renaming, moving, and the like) and we manage the state of the file by adding or changing names. The (big) problem with this approach is that the process may commit and the batch VM may fail before renaming the file being processed and hence getting an inconsistent state (it is also easily recoverable)..

                          We are planning on contributing these classes for manipulating files.
                          I would still not recommend manipulating files within your batch processes unless you absolutely have to. Instead, an EAI solution that would rename/move, or upload when a file is completed would be a better solution. I say this because, in my experience, file moving can cause a lot of issues that could needlessly hold up a batch stream, even though it generally has no thing to do with whether or not processing was actually successful.


                          Originally posted by sotretus View Post
                          Lucas, do we have an example with multiple chunks? How would the stepOperations and ItemProvider be configured?
                          I'm not sure I understand what you mean here, do you mean, an example of kicking off an ItemProvider in multiple threads? If so, we don't have an example yet. It's still strictly theoretical.

                          Comment


                          • #14
                            Lucas

                            Thanks for all the suggestions we will make sure to take them into account.
                            What I meant with the "multiple chunks" stuff is if we have an example where a chunkOperation RepeatTemplate is executed several times by the steopOperations RepeatTemplate. In a more functional view, this is when you want to commit several records together in chunks, but have multiple chunks of records to commit.

                            I.e.: I need to transform and update 1000 rows, but I want to process them 100 at a time.

                            Comment

                            Working...
                            X