Announcement Announcement Module
No announcement yet.
Spring Batch: Design question Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spring Batch: Design question

    Hello there! We are starting a new project, and I'm considering spring-batch to be the kernel solution of our architecture.
    I've just read the docs, and now have a basic understanding of it. There are some design questions that I would like to discuss here, and get some feedback from the community.

    Our main bottleneck is network latency. We need to crawl data from several thousands sites. And we have a time window (10 min) to do that. Today, doing this in parallel in several machines would not be a problem, we could spam 50-100 threads for each machine and get the results in a couple of minutes.

    The job afterwards is what drove us toward spring-batch. We need consistency, to make sure that each job (a crawl task) have been executed, if it has fail, it must be retried a few times before we decided to log it on a audition place to be looked at. All this seems to be provided by spring-batch.

    Well, given this scenario, my question rely lies on the better job flow for this. Should I write an ItemReader that spans the many threads, or create parallel jobs (one for each site). Partitioning would also be a must, since our time window is short. But I'll deal with that later, just start with the basics right now.

    Any suggestions for this?


  • #2
    I don't think parallel jobs would get you much benefit - a single job should work. That way partitioning can help you to scale across multiple nodes in a cluster. You main problem will be writing a stateful reader that can be used in multiple threads in a single step. It might work just fine if the reader gets the site details from a shared database, and marks each one as processed as it finishes the crawl (without a transactional datastore somewhere it is hard to do that process indicator pattern which is quite convenient for multithreaded stateful restartable steps). It would be interesting to hear more detail about the use case.