Announcement Announcement Module
Collapse
No announcement yet.
File sorting in batch Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • File sorting in batch

    Hello All,

    In our cobol mainframe program, which we are migrating to java, there is a batch which gets the input file (in text format) and then sorts it in a particular oder (same as order by in sql query).

    Sometimes it orders by first three characters, followed by 24-25 characters and then 50-55 characters in each line. (multiple column ordering). So the whole file is reorderd in particular format.

    I do not know whether this is spring batch related question, but wanted to check whether gurus out there aware of doing this using Spring Batch? This needs to be a separate job, so it has to be written in Spring Batch.

    OR best mechanism is to load each record in the file to a database and then fire an order by query and write the result to a new file?? I am worried about the performance here.

    Thanks!

  • #2
    Before diving into possible solutions, I'm extremely curious as to what the use case is that requires you to order the flat file?

    Comment


    • #3
      good question

      Hello lucasward,

      I do not have a right answer to that question! Our current batch programmers (mainframe/cobol) are doing the whole sorting and deletion of records in the file itself without any database interaction.

      As per my knowledge best way to do it is read the whole data into a database and then do the processing, but curious to find out whether there are any possibilities to do it in the file itself, any fast utilities or methods available?

      If not, then I believe it is not a better option to spend much time on it rather load the data into database.

      Thanks!!

      Comment


      • #4
        Loading into the database first is certainly the easiest solution, especially in Spring Batch, since you could take advantage of its ItemReaders and ItemWriters. You'd hardly have to write any code. (Assuming you don't have any business logic to apply, which it sounds like you don't)

        However, it depends on the size of the file and where this process fits in your batch solution. If the file is reasonably small (i.e. not 40 gigs) it wouldn't really matter that you've put the data in a database first, even if you're immediately going back to a file, it would probably be fast enough regardless. However, even if the file is larger it depends on how the result is going to be used. If the file is just sorted and then uploaded to another system, you might even think about using something like the unix sort command and upload it. However, if sorting the file is the first step in a large 'stream' that works off the same file, you might gain an advantage by loading all the data into the database and working on it completely from there (assuming you're able to rewrite the other jobs as well)

        Sorry to answer your question with 'it depends'. There's not any file sorting utility in Spring Batch or java in general that I'm aware of. I personally like to get data into the database as fast as possible. In my experience supporting Batch applications, operations that work with files tend to be the most brittle. The majority of the time that I was paged at 3 in the morning it was because some job that was dealing with files had an error. Although, it still depends upon the situation, and at times you have to be pragmatic.

        Comment


        • #5
          Thanks!

          Hello Lucasward,

          Thanks a lot for your input!! Since we would be doing lot more than with that data that is being read I believe I would go with reading the file data into database first and then process it.

          If it is a huge performance hit, then might think for other alternative.

          I completely agree with your point that updating data in file while we read it etc will be very delicate and brittle operation and it is better to not get called at 3:00 AM in the morning.

          Thanks again for your help!

          Comment


          • #6
            Sorting and filtering files

            For what it's worth I have found that sorting and filtering files can speed up the process even if you choose to stage the data into database tables for processing. I know on project that used syncsort - http://www.syncsort.com/products/ss/home.htm - and another that I was on simply used the unix sort facilities. However, I don't believe the unix sort utility will filter records out of the dataset, which would make syncsort your better option. Regardless, sorting the files before loading into the database significantly improved performance.

            Wayne

            Comment


            • #7
              Hi.

              We were having a similar concern about file sorting. And we have to cope with the fact that the host can sort 8 million records (say, 300GB) in about a minute.

              Assuming we could try to find an approach to solving this problem, we are more than certain that we'd have to engage into parallel processing of the file. We're not actually dealing with it (yet) though.

              Comment


              • #8
                Did any one find a spring batch based solution for the sorting ; I have a similar sorting that need to be done on the fly to check-sum and compare 2 files

                Thanks
                Raees

                Comment


                • #9
                  In my past experience, when we had to sort a file (which we avoided as best we could using a number of techniques), the syncsort product mentioned above was the best option. Otherwise, the easiest approach in a purely spring batch option would be to import it to a database table then generate a new file.

                  Obviously you could write a tasklet that sorts a file but you'd be responsible for reading the entire file into memory, sorting it, etc.

                  Comment


                  • #10
                    Originally posted by mminella View Post
                    In my past experience, when we had to sort a file (which we avoided as best we could using a number of techniques), the syncsort product mentioned above was the best option. Otherwise, the easiest approach in a purely spring batch option would be to import it to a database table then generate a new file.

                    Obviously you could write a tasklet that sorts a file but you'd be responsible for reading the entire file into memory, sorting it, etc.
                    I'm trying to get my head around how I might use Spring Batch to replace Informatica, a GUI centric Extract / Transform / Load tool, commonly used to load a Data Warehouse. In Informatica, a mapping contains one or more sources (similar to ItemReaders) and one or more targets (similar to ItemWriters). In between the source(s) and target(s) are zero or more transformations, similar to ItemProcessors. Common transformations include sorting, aggregating, joining, filtering, lookup, filtering and routing (there are a number of others). I see how an ItemProcessor could be used to implement filtering and routing. A lookup transformation (given a natural key, go find the dimension key) is also pretty straight forward, using a key/value db.

                    I'm struggling with how Spring Batch would model the transformations that have to operate on sets of data, such as sorting and aggregating. The suggested solution is to send the set to the database, do the sort and write the results to a flat file. That sure seems like a lot of network traffic, especially when the sets contain millions of records. Would it be a better use of Spring Batch just to get all of the sources into the DB and then write some stored procedures to do the Transformations and load the target tables? This way, once the data is in the DB, it stays in the DB.

                    Thoughts?

                    Comment


                    • #11
                      While Spring Batch definitely can do ETL type of processing like what is done in Informatica, the components will not be a one to one mapping. There is a key reason for that. Informatica deals with the entire data set at once (it actually will attempt to load the entire dataset into memory if possible). Where as Spring Batch is item based. Sorting is a data set focused activity so it is not going to be very efficient to do the sort itself in Spring Batch.

                      Without knowing what your transformations are, I would say that doing things like sorting via SQL in a database will be a better approach than piping it through Spring Batch and having a processor cache and sort the items (which may not even be possible depending on the amount of data).

                      Comment

                      Working...
                      X