Announcement Announcement Module
No announcement yet.
Large files handling Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large files handling

    we use spring batch as the core of our batch processing module, so i tried to use spring batch with large xml files (> 5M)
    but it crashes with OutofMemory exception, so i would like to know the limit of files size that spring batch can handle and how to handle xml files that are bigger than 10 mega bytes.

    thank you.

  • #2
    There shouldn't be any limit on the file size imposed by Spring Batch itself - can you provide more details about the problem?


    • #3
      I have personally worked with clients with much larger than 5 mb files, and I know we performance tested the xml reader a couple of months ago with 100+ mb files with no OutOfMemory issues. Is there any other logic within the job? I'm also curious what your commit interval is and if you are using any skip or retry logic. Also, what version of the framework are you using?


      • #4
        here are the details;

        - I used the StaxEventItemReader with UnmarshallingEventReaderDeserializer and XStreamMarshaller as illustrated in stax sample.
        - I made an event to read the xml fragments using this item reader.
        - The declared fragment holds 3 string nodes which mapped to a domain class with 3 string fields.
        - Spring batch crashes after consuming about 4 MB of data from a 10 MB file size.

        here are the code of the step execution method

        ((ItemStream)itemReader).open(stepExecution.getExe cutionContext());

        Object emp =;
        int i = 0;
        while(emp != null){
        emp = null;
        emp =;
        }catch(Exception e){
        throw new RuntimeException(e);
        }finally{ ((ItemStream)itemReader).close(stepExecution.getEx ecutionContext());



        • #5
          I'm extremely curious about this while loop, why not just return the emp from the reader? Was it some kind of performance test? Did you run this stand-alone or as part of a step?


          • #6
            -We need to read the xml file and store all data in the database so we will have a step to perform this task, reads the xml file using the xml item reader then writes these data to the database using hibernate cursor item writer.

            -The while loop to read all xml fragments.

            -I didn't understand the meaning of (why not just return the emp from the reader? )


            • #7
              You shouldn't be trying to read in the entire file at once, attempting to do so is what is causing your error. Instead, read one 'item' (i.e. an emp) from the file, then write it to the database (by simply returning from the reader), then read the next one, and write it, and so on. This is the essence of batch processing. Even if you used StAX directly with an xml binding framework (i.e. without our reader), you would likely get the same error. This is because the parser will create a lot of objects (one for each event) as will the binding framework, and since you're spinning through the whole file, the GC can't catch up. Furthermore, using the read one-write one approach allows you to take advantage of advanced framework features. For example, you can use the framework to configure how many items will be processed per chunk. This means, how many items you will read and write in each transaction (LuW). You can also use restart. Before commit an transaction, the framework will store the current state of your reader, so that in the case of a failure, it can start processing again from where it left off. If you read in the whole dataset at once, there's no way to accomplish this.


              • #8
                Thank you very much.
                I will try this approach and i will be back.


                • #9
                  I was just reminded of another reason that your while loop is causing an issue. In order to support rollback we buffer the events. So, when you call Mark() (which happens by default on the first read even if you don't call it) we buffer all of the events that happen until the next mark, so that when you call reset, we can regenerate the items. There is some talk about potentially buffering the items themselves, but either way the framework has to hold on to some items in memory until the next call to mark.