Announcement Announcement Module
No announcement yet.
Handling of the complex input file structure. Looking for piece of an advice. Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Handling of the complex input file structure. Looking for piece of an advice.

    I've been prototyping a simple solution to extract EQUENS CLIEOP payments file using Spring Batch. It worked for me very well, but I'm struggling with more complex input format.

    CLIEOP file is a flat file with fixed length records. The structure of the file is:
    1 x File header (single-line record)
    n x Batch header (multi-line record)
    m x Item (multi-line record)
    n x Batch trailer (single-line record)
    1 x File trailer (record)

    As you can see the structure is pretty complex and both n and m can be significantly big numbers (hundreds of thousands).

    Can you, please give me some tips how to use Spring Batch to efficiently transform the file into domain model (POJO beans) that consists of File, Batch and Item?

  • #2
    It depends a bit what you want to do with the domain objects. If you really have 2 different item types in the same file and you need to process / write them independently, then I would probably think about processing the file twice, in order to avoid saving large amounts of data in memory. It might also be advisable to stuff the records in a database, especially if the two item types have dependencies or associations between the items. The File level data can be dealt with in an ItemReader that is both a delegate to a file reader and a step listener that can stash the file meta data for the duration of the step or job.


    • #3
      Processing twice or more the same file maybe is not necessary. You can prepare IteratorItemReader which will serve object after object (e.g. FileHeader, BatchHeader, Item, [...], BatchTrailer, [...], FileTrailer) and then item processor as a delegator based on type of object. In that way you can keep FileHeader, and last BatchHeader as a states in your ItemProcessor (for use during processing of item).
      To improve performance a splitter based on batch should be considered.
      In the therm of CLIOP files most problems will be with correct parsing that file which can has a optional lines with various cardinality.


      • #4
        BTW: IteratorItemReader and DelegatorItemProcessor should be a common spring batch classes
        Maybe they are worth to discuss and include them into code base.


        • #5
          Probably a little late to be chiming in, but isn't this file layout sort of the premise of the AggregateItemReader?

          Not sure how well that will work for you with the file header and file footer in the mix, their sample file didn't have those, and I haven't looked a the base implementation of the class or used it enough to know if it will work. Some derivation thereof might though.


          • #6
            I can think of 2 approaches to validate file structure. Not sure which is preferred/correct.

            1. Create a new step 0. This step rifles through the file and validates 1 header/1 trailer. Can validate header record counts, and any trailer sums if your file has such a thing.

            2. Read the file only once. Implement a StepExecutionListener and override the beforeProcess() method and build out a mini state machine (*not truly a state machine since I don't transition states, but a similar concept).

            State machine would look something like the following:

            a. ) Encounter header - set some file header flag to true. Verify detail flag and file trailer flag both false.

            b.) Encounter detail - verify file header flag is true and file trailer flag is false.

            c.) Encounter trailer - verify file hader is true, detail is true/false (detail depends on requirement. 1 header/0 details/1 trailer seems valid to me).

            I am confident either approach will work. I am more concerned which is the better one. The first approach could have performance concerns. Not sure how fast it is to rifle through 100k rows and tokenize, build out objects and validate structure. Second approach complicates the job logic considerably for something that should be relatively trivial.

            As far as the original poster is concerned, your file is a little more complex than mine since the middle piece is a record set and not individual records. Assuming you have your reading strategy worked out, I would do it in an afterProcess( ) method rather than beforeProcess () that way you can validate your domain objects and use similar logic to above (I am assuming you will consume that record set and map it to a single POJO).
            Last edited by Trunks; Dec 8th, 2009, 01:40 PM.


            • #7
              Dave and all,
              Any final suggestion for processing this type of file format ? I am also looking for handling same type of file. "tblachowicz" - which one you used finally ? Any inputs ?

              This type of format is very common in financial domain. So , would love to hear best solution from you guys , also sample in batch-samples would be great in future.

              "Jul" - Looks like IteratorItemReader is part of code base now but not DelegatorItemProcessor.



              • #8
                Hi skdian,
                Finally I have used Artix Data Services to parse a file:
                - I have created model and transformation to POJO
                - I have created Iterator implementation to read objects from ADS transformation (iterator is injected into IteratorItemReader)
                - I have prepared ItemProcessor which keeps a state (in job scope) of last header (file/batch)

                DelegatorItemProcessor may simplify code a bit but I had no chance to prepare something like this. Maybe I will do it later

                - items in file are ordered so in the term of F&R we can save a state (currently IteratorItemReader doesn't), but it will be work correctly only for single threaded processing and may cause a probleams when any skip policy is used, please analyse carefully other way is to reprocess whole file,
                - I'm usually process files like this in a few steps, first step is responsible for loading business objects from file into staging DB area, second is validation which chacks consistency of stagged data, third and next are process stagged data