Announcement Announcement Module
Collapse
No announcement yet.
Large File Upload using Hadoop + SpringMVC ? Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large File Upload using Hadoop + SpringMVC ?

    I've been using Spring3 MVC for building a web application. One of the features of this web application is uploading a large file. The problem is uploading file size is very big - up to 8 Gbytes !!!
    Can I use Hadoop with Spring3 ? What I need to do is
    1) I need to upload file (or files or directory of files) up to 8 Gbytes using Hadoop
    2) store the uploaded files in our local file server


    FYI, Our current system is using opensource - valumn ajax file uploader, but it has many issues, esp. performance issue and the case when the network is disconnected during the uploading a large file. So I'm surveying other options to handle issues.
    1) using other opensource/applications for file uploader
    2) using Spring Batch?
    3) Hadoop?
    4) Implement Java from the scratch which takes too much time and effort

    Any suggestions?

  • #2
    You seem to have multiple issues here. Regarding the upload, I think first you need to decide on what strategy to use since 8gb do take some time even on a fast connection. Using a background task (with resume/retry) seems like a good solution. Potentially you can try chuncking the file in smaller pieces so it's easy pick things up if they fail.
    This is not something that Spring for Hadoop addresses - potentially you can use Spring Batch. Note however they are frameworks - toolkits that you have to use to accomplish your goals.

    You could try Hadoop as a store, that is use HDFS to store and read data. If that's the case then Spring for Apache Hadoop gives you plenty of access and flexibility in accessing it (either the fs shell, the fs api whether you want to use Java or other JVM alternatives).

    Hope this helps,

    Comment


    • #3
      Thanks for fast reply.
      Since I'm the only one software developer in my group and no one else to discuss
      and thus need to decide everything myself. So your opinion is really tons of help.

      Could you clarify a little bit more? The one of requirements of my project is as follows.
      When your upload/download a large file, the network might be disconnected or browser can be closed intentionally or by accident.
      In this case, I need to resume/retry loading the file next time continuously.

      So your opinion is that Spring Batch might be meet this requirement
      since (according to my brief survey of Spring Batch) it is basically chunking the file into small pieces?

      Actually, what I'm confused about Hadoop is when I use Hadoop, an uploaded file/data is stored in one of servers
      outside of my company (like Amazon, Google or some others) or it can be stored in one of servers in my company?
      I googled it, but most of Hadoop documents only explains that you can upload data/files to server
      and then multiple servers analyze the massive data in parallel ways.

      So Hadoop HDFS is only good for fast analyzing data,
      not good for fast reliable sending a large file to a particular server in one of my machines?

      The key point for my project is speed - reliable fast upload/download.

      Thanks

      Comment


      • #4
        Sounds like you need to do various PoC to figure what works and doesn't for your case. Spring Batch is not focused on file uploads or anything like that - it provides generic a framework for Batch operations (upload being one of them). Check out the site and the samples to get a better understanding of it. I'm not saying that it doesn't work, rather that you don't just download it and you're done - you still need to figure out how to handle the upload and the chunking strategy and put that into Spring Batch.

        If you don't know what Hadoop is then probably it's best to postpone the discovery until you have the upload part done. In a nutshell Hadoop provides you with the infrastructure for setting up a cluster of machines to: store data and search the data stored. The two go hand in hand since in order to store a lot of data, you need a lot of machines and in order to look at that data you need a lot of processing power. Whether you use Hadoop just for storage or not it's up to you.

        Comment

        Working...
        X