Announcement Announcement Module
No announcement yet.
Splitting large files into smaller files Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting large files into smaller files

    Hi ,
    I am using spring integration along with spring batch for processing batch files. I use spring integration to poll a directory and if a file of a particular name pattern is found, use the spring-batch-integration and the JobLauncher to launch a job. Now, we have been facing issues with spring batch while trying to process huge files of more than 5.5 million rows which runs for about 9 hours. To improve performance, we decided to use partitioning in spring batch and split the incoming file into smaller chunks which could then be processed in parallel. Due to time constraint, we wrote a bourne shell script that did the split after validating the footer for number for records which is invoked from a java class. However, I wanted to know if there is a way in spring integration where we could provide information as to number files , the master file is to be split to and the folder where to put the split files. Also are there any performance hits doing it so versus shell scripts? I can understand that shell scripts would be faster and am prepared to live with minor performance decrease as long as I can centralize my logic in one place instead of maintaining a separate shell script and the additional burden of having to find another way in case of windows servers. I would be very interested in your experience and suggestions on this. Can there be a FileSplitter component that serves this purpose if there no way of doing it easily at present?


  • #2
    Spring Batch has the ability to pick up files as input based on an expression so as long as the resulting files have a predictable name, Batch handles this use case already. With regards to the splitting of files itself, you could implement a Partitioner to to the division of the file into multiple. We don't have this component currently for the reason he mentions...most people do it via a script of some kind for performance reasons (split is going to be much faster than piping it through a job). The closest we come currently is the MultiResourcePartitioner which assumes that the file division has already occurred.

    An alternative Spring Batch only approach would be to actually create a step that pipes the input from a FlatFileItemReader to a MultiResourceItemWriter. The MultiResourceItemWriter will create a new file based on the number of records written. This would allow the breaking up of the files to be independent of the step processing them as well as use 100% off the shelf components.


    • #3
      Hi Michael,
      I agree on the other parts within Spring Batch for handling partitioned file and they work just fine. I tested it out and I am happy about the working . My only gripe was about handling the splitting externally. It would have been great if spring integration had a pre-built component , some thing that extends the splitter functionality but specific to files which we could leverage to split before launching the job itself. I only wanted to know if Mark and team think this is component worth developing since it nicely fits into the spring-batch-integration project or maybe even in the core spring-integration components and could help drive traffic for the parallel batch processing.



      • #4

        It does sound like a useful feature. For now, you should be able to use <splitter ref="fileSplitter" .../> and then have a "fileSplitter" bean that takes a path and returns the files/sub-paths whatever (basically, single-object in list/collection/array out)... you could return Message instances if you also want to add headers there (or do that as a simple header-enricher upstream), such as the 'totalFileCount'.



        • #5
          Thanks Mark. Yes, that will do . Eventually I was hoping to use something like this :
          <file-splitter input-path="" output-path="" naming-scheme="" file-split-number="" file-split-size/>
          1. input-path would be the path to the file if it is the first element in the flow else, could be an expression to take from the file poller payload.
          2. output-path would be where the split files need to be put.
          3. naming-scheme would be how the split files need to be named
          4. file-split-number could be the number of split files required.
          5. file-split-size could be used instead of file-split-number if we need to split by size.

          All of the above could also be based on expressions or bean references.

          Just thought this up, it may not be perfect or what you may agree to. But just wanted to throw it out there if it helps in someway if you go down that direction.