Announcement Announcement Module
Collapse
No announcement yet.
RecursiveLeafOnlyDirectoryScanner for vast files Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • RecursiveLeafOnlyDirectoryScanner for vast files

    Hi I am looking for a file scanner similar to RecursiveLeafOnlyDirectoryScanner which will scan through a huge file list. As per documentation, this scanned is not suitable for a folder with vast files. What is the alternate to this?

  • #2
    Are you saying you have a big tree of folders or you have a lots of files in your directory alone and the directory structure is flat?
    Can you explain your use case and provide some figures like the number of files you expect to see in a directory, the folder structure etc.

    Comment


    • #3
      Just to add some more details. A RecursiveLeafOnlyDirectoryScanner is relevant only when you have a nested directory structure you want to read.
      As far as the scanning when large number of files is concerned, even the default implementation DefaultDirectoryScanner might cause problems over a course of time as the AcceptOnceFileListFilter contains the java.io.File instances it has processed in the memory which will grow with time. Also all the files already processed will also be re processed on restart. You have provisions to provide your custom implementation of DirectoryScanner too. But, of course the best approach can be decided when you let us know the information asked above

      Comment


      • #4
        Ok my requirement is similar to this..I have to scan through a folder of the structure:

        <date>/<stock_category>/<stock_sector>/<stock_code>.xml

        So I have individual stock data which is added to a folder categorized in the above manner..I have to run a scanned at the date level and check if folder <date> matches the current date, I have to process all the files (*.xml) residing under it recursively and process them. Again i have the requirement not to process already processed files (refer the thread http://forum.springsource.org/showth...ent-duplicates)

        Now coming to the load I will have anywhere around 3000 xml files coming daily. And the scanner will be continuously scanning for any new files (un processed files)

        Please help me with the use case

        Comment


        • #5
          a correction to my previous mail.. The volume is expected to be around 1000000 per day.

          Comment


          • #6
            Will there be a burst of files or they will come over a period of 24 hrs?

            In such use cases it is a common practice to read a file, process it and on processing move the file to a processed directory.

            So to fit these adapter for your use case you will need an inbound-channel-adapter which can have a service activator downstream which on processing will send a message to an outbound-channel-adapter that will move the processed file to an output directory for processed files on with the delete-source-file attribute set to true.

            Now considering your volumes the AcceptOnceFileListFilter doesn't seem to be a good fit as the processed java.io.File instances will be held in memory. You can perhaps provide your own filter implementation similar to AcceptOnceFileListFilter but one that holds the processed java.io.File instances in a persistent store rather than in memory.

            Comment


            • #7
              The files will not come all of a sudden. It will be added in short but non uniform intervals during day time (market hours).
              Also we will be reading the files from a mounted NFS drive. So, dont have the luxury to move the files to another location, delete from original location etc.
              Coming back to writing own PersistFilter, will it be a good option? Because it will be like checking in the persisted memory (may be database of file system) before every save and also updating the persistence store (db or file system) for every processed file to update the memory. It adds to processing time for each transaction.
              Let me know if I am going wrong here.

              Comment


              • #8
                So, dont have the luxury to move the files to another location, delete from original location etc.
                The problem with this is, when ever you list the files in a directory, you will get all the java.io.File instances in memory, even those that got processed (many of those you will be not processing as they might already been processed).

                Also, is it possible to have multiple inbound adapters one for each stock category (provided you know them statically) and schedule them at different intervals? This can effectively partition the total number of files those are to be processed.

                Coming back to writing own PersistFilter, will it be a good option? Because it will be like checking in the persisted memory (may be database of file system) before every save and also updating the persistence store (db or file system) for every processed file to update the memory
                This shouldn't be a problem keeping in mind that
                a) The number of files is huge and keeping those in memory is not a good option, and
                b) in case of a crash, all the files will be reprocessed which is not desirable.

                To keep the data in the table under control you can perhaps have another batch job that will delete from the table the records processed on or before say N - 1 days. The store may not necessarily be a relational database, it could be some NOSQL solution too which gives you fast read performances.

                It adds to processing time for each transaction.
                This looks more of a batch job to me where the processing time of the file is way too high then the time need to check the store if the file is processed or not. So it should be fine.

                Comment

                Working...
                X