Announcement Announcement Module
Collapse
No announcement yet.
Parallel downloading of files from FTP servers Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parallel downloading of files from FTP servers

    Hi,
    I am new to spring integration. We are currently using commons-vfs to download the files from the FTP servers and sometimes we need to download more than 20 files. Right now the downloading of files will happen in sequential meaning in a single thread. Is spring integration provides any features to download it in multiple threads? It will drastically reduces the time taken by our batch jobs.

    regards,
    Ram

  • #2
    If you would configure concurrency for the poller on an FtpSource this should work. Each receive() call would block until a download is finished and then give you a Message<File>. There is no reason why this shouldn't work if you call receive from multiple threads. Please try it out and let me know how it goes. We're currently busy with some refactorings in FtpSource, so your input might be right on time.

    Comment


    • #3
      I have a similar use case, where about 50 ftp servers are polled parallel. I read all needed ftpservers from a database and configure them dynamically. At the moment I only tested 2 in parallel, which works fine.
      However every FTPSource only processes 1 file on each poll, so the poll interval has to be small and for every file poll a new connection is established. (To me this smells like a bug, I expected the Source to poll until no messages are processed anymore. I'll open a JIRA entry for this).

      Regards,
      Maarten

      Comment


      • #4
        Funny you mention this, my changes (which have not been checked in yet) actually remove the disconnect from poll so that the connection stays open. We have been discussing to use a connection pool of some sorts.

        Does it make sense to close the connection always if there are no more messages? This would still cause overhead when polling an empty ftp directory though.

        I'd love to hear your thoughts on this.

        Comment


        • #5
          If on every poll interval all remaining messages are polled, the polling interval of a ftp source in the usual use case probably will be much bigger than it is now (e.g. something like 1 hour or even more). So the connection should be closed when there are no more messages. (Of course it could be a configurable behaviour.)

          It would be a very good thing to have an influence on the maximal open ftp connections, e.g. by using something like a ftp connection pool. In our use case it is possible that we configure about 50 ftp sources which all connect to the same ftp server with different usernames/passwords. At some point some firewall could register this as a denial of service attack. Therefore some control on the maximal open ftp connections would be nice.

          Comment


          • #6
            Originally posted by mdond View Post
            If on every poll interval all remaining messages are polled, the polling interval of a ftp source in the usual use case probably will be much bigger than it is now (e.g. something like 1 hour or even more). So the connection should be closed when there are no more messages. (Of course it could be a configurable behaviour.)
            I've modified the FtpSource to extend AbstractDirectorySource<List<File>>, meaning that it will return messages that contain lists of files. You can configure the size of the batches by setting maxFilesPerPayload on the FtpSource, the default is -1 (take all).

            There are some concerns with multi threading though.
            It would be a very good thing to have an influence on the maximal open ftp connections, e.g. by using something like a ftp connection pool. In our use case it is possible that we configure about 50 ftp sources which all connect to the same ftp server with different usernames/passwords. At some point some firewall could register this as a denial of service attack. Therefore some control on the maximal open ftp connections would be nice.
            Currently multithreading on the downloads doesn't work properly yet, but you can already try it out if you check out the head and build it yourself.

            Comment


            • #7
              There are some concerns with multi threading though.
              I didn't notice any multi threading problems with FTPSource. Are there concerns only with the new List<File> version or does this also apply to FTPSource in 1.0M5?

              I'll have a look at the new version.

              Comment


              • #8
                The multithreading problem now is that if you call receive before onSend relating to the previous receive (if that makes sense) you'll get the same list of messages. I haven't got an integration test that exposes this in real life, but it is theoretically possible from the FtpSource's point of view. You can see this if you move the onSend call in FtpSourceTests.retrieveMaxFilesPerPayload(). If I committed that in the mean time that is.
                Last edited by iwein; Jul 30th, 2008, 11:07 AM. Reason: premature posting

                Comment


                • #9
                  The multithreading problem now is that if you call receive before onSend relating to the previous receive (if that makes sense) you'll get the same list of messages.
                  It doesn't make sense to poll again before the last poll has finished, does it? So this shouldn't happen.

                  Can it happen at the moment? SourceEndpoint.poll() isn't synchronized, but probably it doesn't have to, probably one thread is dedicated to polling the SourceEndpoint and won't poll again until the last poll has been finished. I hope it is implemented this way?

                  It doesn't make sense to solve this in FTPSource or SourceEndpoint, I think. If the polling schedule is too small to poll the source regularly and more than one thread would start this polling, this would result in a growing amount of threads waiting for each other, which results in a disaster anyway.

                  Comment


                  • #10
                    Currently the source is intended to be used from a single thread. Since the poller is going to support concurrent scenarios, it is theoretically possible to hook up concurrent workers to the same source. This might be wanted behavior if there are many small files on the remote directory and you want to download them in parallel to optimize network usage.

                    So we can't keep it single threaded in general, we need to make the receive method thread safe somehow. Reading your comment though I'm wondering if maybe just synchronising receive() would be good enough for now?

                    Comment


                    • #11
                      Reading your comment though I'm wondering if maybe just synchronising receive() would be good enough for now?
                      This would synchronize the downloading of the many small files, so it would behave like downloading with a single thread. This way the concurrent scenario of the poller to support many small files wouldn't optimize network usage, it only would require more threads without any benefit. Is this concurrent poller scenario really useful? Probably in most cases different ftpSources have to be handled, these should be polled concurrently, but I don't see a real case for the same FTPSource polled concurrently. I am not sure, if this was the intention by ramkris.
                      Last edited by mdond; Jul 31st, 2008, 08:04 AM.

                      Comment


                      • #12
                        Well the concurrent poller scenario goes for all sources, it has its uses in other areas. I fully agree that if I synchronize receive there will be no benefit in concurrent polling on the source side. I interpreted ramkris' post as a request for downloading in multiple threads in the same source.

                        @ramkris, are you still following this? (now would be a good time to nudge things your way )

                        Comment


                        • #13
                          However it wouldn't be enough to synchronise receive(). Its the combination of receive() and onSend() that must be synchronised.

                          Comment


                          • #14
                            You're absolutely right, we need a monitor of some sort.

                            Comment

                            Working...
                            X