Announcement Announcement Module
Collapse
No announcement yet.
Spring Batch : multi file processing using multi threads Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spring Batch : multi file processing using multi threads

    I have requirement where I have to deal with multiple files (say 300 csv files).

    I need to read --> process --> write, for each individual file as I need to apply some transformation logic on the data.

    For each input file there would be a corresponding output file. so for 300 input files we would have 300 output files. At the end, all the 300 output files are needed to be merged into a single file which would be compressed and then transferred to a remote location over FTP/SFTP.

    Say, every hour we would have to deal with a new set of 300 files on which we would be required to apply the above processing, so we would be scheduling the above job per hour.

    1. How to handle multi file processing in the above scenario using Spring Batch ?
    2. How to make the above processing to happen in multiple threads ?

    Please suggest. Thanks in advance.

  • #2
    We can read multiple input files using MultiResourceItemReader and set FlatFileItemReader as it's delegate.

    But the question which is yet to be answered is, do we have an ItemWriter which when used with MultiResourceItemReader would write an output file for each individual input file which is being read.

    The target is that if we have 10 input files then there should be 10 corresponding output files to be written.

    It would be really helpful if someone can give some insights into it.

    Comment


    • #3
      I found a solution to this in the partitionFileJob example from the spring-batch source, spring-batch-samples/src/main/resources/jobs/partitionFileJob.xml. You can download source from http://static.springsource.org/sprin...downloads.html.

      Comment


      • #4
        Thanks Bruce...partitionFileJob is the answer which I was looking for...It worked...I was able to find it last week but could not post it here...
        It also has the capability to launch multiple threads for multiple files...
        Thanks once again.

        Comment


        • #5
          Glad it worked for you. In the "multiple threads for multiple files" scenario, can you tell if each thread gets its own partition? Or are there multiple threads in the same partition?

          Comment


          • #6
            I wrote a quick little test where I put specific ids in three different files. File 1 got ids that look like 1..., File 2 got ids that look like 2...., File 3 got ids that look like 3....

            I wrote a FlatFileItemWriter that spits out the thread and the ids that are being written.

            Code:
            	<job id="widgetLogProcessorJob" xmlns="http://www.springframework.org/schema/batch">
            	    <step id="start">
            	        <partition step="partitionWidgetLogs" partitioner="partitioner">
             	            <handler grid-size="7" task-executor="taskExecutor"/>
            	        </partition>	    
            	    </step>
            	</job>
            
            	<bean id="taskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor">
            	    <property name="concurrencyLimit" value="7" />
            	</bean>
            Code:
            class MyFlatFileItemWriter<T> extends FlatFileItemWriter<T> {
            
            	MyFlatFileItemWriter() {
            		super();
            	}
            	
            	public void write(List<? extends T> items) throws Exception {
                            ThreadUtils.writeThreadExecutionMessage("write", items);
            		super.write(items);
            	}
            }
            So it looks like each thread works on its own file.

            Code:
            thread SimpleAsyncTaskExecutor-3 - write widgetItems(s) with id(s) #200000000001 #200000000002 #200000000003 #200000000004 #200000000005 #200000000006
            thread SimpleAsyncTaskExecutor-1 - write widgetItems(s) with id(s) #100000000001 #100000000002 #100000000003 #100000000004 #100000000005 #100000000006
            thread SimpleAsyncTaskExecutor-2 - write widgetItems(s) with id(s) #300000000001 #300000000002 #300000000003 #300000000004 #300000000005 #300000000006
            There was some concern that FlatFileItemWriters and Readers are not thread safe and now we are having multiple threads. But it looks like the step scope ensures that each thread works on its own file.

            Comment


            • #7
              Yes, scope="step" attribute with the respective beans in the context file does the trick.

              I have another query here:
              If for 10 input files I create a single output file which would have data from all the 10 input files, I would have to provide the same output file name to each input file and put appendAllowed="true" under FlatFileItemWriter in context xml.
              I also intend to add some header lines in the output file which I am achieving by using headerCallback under FlatFileItemWriter.

              But the issue here is that I am still using the multiFilePartition job and I want the header to be called only once when the output file is created for the first time and for rest of the time it should just append the records from multiple input files into the output file.

              Please provide your inputs on how should I achieve this ?
              I am searching for the same and will update if I get any solution.

              Comment


              • #8
                I figured I couldn't have it both ways. Either one file or 10 files. If ten threads were writing to the same file, then something would need to coordinate the writes. I like the performance benefits of the partitioned jobs. I just combine them into one file outside of spring batch.

                Comment

                Working...
                X