Announcement Announcement Module
Collapse
No announcement yet.
multi threaded FlatFileItemReader Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • multi threaded FlatFileItemReader

    I am busy handling a huge CSV file in input where the staging functionality may be a killer as the actual processing is very simple (sql insert).

    I wanted to have a partition where each partition would stream part of the file (line 0 to line 9999, line 10000 to 19999, etc). The restart ability of each partition could be guaranteed by storing the chunk boundary in the execution context and streaming again the file upon restart (start + read.count)

    If it's not built-in in Spring batch, I'd like to understand what's wrong with my reasoning or is it simply a valid feature that is not implemented?

    Thanks!

  • #2
    It should work as you describe just fine, but we never provided support for this explicitly in the framework because I'm not convinced it helps. Especially if the output side of the transaction is quite simple, I'm not sure if it will help to partition the file this way because all partitions will have to read the file up to the point where they can start processing anyway. If you try it and it helps let me know.

    Comment


    • #3
      Originally posted by Dave Syer View Post
      It should work as you describe just fine, but we never provided support for this explicitly in the framework because I'm not convinced it helps. Especially if the output side of the transaction is quite simple, I'm not sure if it will help to partition the file this way because all partitions will have to read the file up to the point where they can start processing anyway. If you try it and it helps let me know.
      Hi Dave,

      We did implement the multi thread file reader and the related partitioner to give it a shot. The partitioner is restricted to one line = one item but it can be quite easily extended.

      Here are the results with a CSV file containing one million entries. The job only reads the item, parses it to a vo and passes it to a JPA dao backed by Hibernate.

      • Single thread (standard SB reader) commit-interval=5 : 15 min 54 sec
      • Single thread (standard SB reader) commit-interval=50: 7 min 44 sec
      • Single thread (standard SB reader) commit-interval=100: 7 min 06 sec
      • Multi-reader with a partition gridSize=10, commit-interval=50 : 6 min 18 sec
      • Multi-reader with a partition gridSize=4, commit-interval=50 : 3 min 47 sec


      You are right, the IOs are completely saturated when the threads started to move the cursor at the right spot. Once this is done, processing is obviously much faster.

      If you are interested by the code of the reader and the partitioner, my company is happy to contribute it back.

      Best,
      Stéphane

      Comment


      • #4
        Interesting data. Can you make a JIRA and post your code there?

        Comment


        • #5
          Here you go:
          https://jira.springframework.org/browse/BATCH-1613

          This was on a MacBookPro with an SSD drive by the way but we also have a benchmarking infrastructure with no-SSD drives. The setup is slightly different but the performance increase factor is almost similar.

          Comment

          Working...
          X