Announcement Announcement Module
No announcement yet.
Batch High Availability and Failover Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Batch High Availability and Failover

    I would like to know what mechanisms the Spring Batch framework provides to support high availability and failover for batch jobs. For example, most J2EE servers provide high availability and failover via clusters, load balancers, etc. The Quartz framework can be configured to provide H/A and failover for scheduled jobs via a database. So how can Spring Batch jobs be configured to provide failover throughout the job lifecycle (scheduling, execution, etc.)?

  • #2
    "High Availability" and "Failover" are concepts more relevant to the data that a job might be using, rather than the job itself. There are of course many mechanisms to ensure those features in a clustered environment (e.g. JEE application servers, data grids, hardware configurations).

    In a batch job the relevant concepts are re-startability (can a failed job be restarted and pick up where it left off?) and idempotence (can a job be re-run and not need to pick up where it left off?). We provide various mechanisms to help with restartability. And idempotence can be ensured by using common patterns in the data sources (process indicator pattern - see user reference guide for some more description, or the TSE2007 batch presentation slides). Idempotence is called "re-runnability" in one of the TSE2007 slides (it's a term used internally by some folks at Accenture).

    If the lights go out on a job, and it is restartable or idempotent, all you need to do it run it again. So your reliability comes from the scheduling tool (we don't do scheduling or triggering in Spring Batch).


    • #3
      Even though it's late to reply. There is a way how you can implement a High Availability for your batch job. In my opinion about HA for Batch, it is when primary is down, a backup will automatically handle the next job. The next case what if the backup failed and primary is down as well, to answer it, you can have multiple backup or make sure you have a watchdog that can rollback if it's a transaction or implement an auditing process that track where it start and failed and use a logs or an email notification if it failed. It still depends on how the application was architect. I don't rely on some JEE container to handle clustering specially running a low-latency code that doesn't require JEE container.