Announcement Announcement Module
Collapse
No announcement yet.
Problems with JobExecution.isRunning () Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with JobExecution.isRunning ()

    Since 1.0.0.m4, a new test have been added when re-executing a failed batch.

    The new test lies in org.springframework.batch.core.domain.JobExecution class.
    When you re-start a failed batch, it checks if the jobexecution is running or not.
    I've checked that 1.0.0.m3 didn't do this test.

    Here is the method:

    /**
    * Test if this {@link JobExecution} indicates that it is running. It should
    * be noted that this does not necessarily mean that it has been persisted
    * as such yet.
    * @return true if the end time is null
    */
    public boolean isRunning() {
    return endTime == null;
    }


    What's the point ?

    Considere the following scenario:
    - You start you're batch
    - It creates a new job instance and job execution
    - You simulate a hardware failure by killing the batch while it runs.
    => job execution state is still STARTED with NO END DATE, because the batch haven't had any chance to update DB before it gets killed.
    - You restart the same batch.
    => It finds again the same JobInstance and JobExecution, but sees that the JobExecution.isRunning () is true, so it returns an error:
    org.springframework.batch.core.repository.JobExecu tionAlreadyRunningException: A job execution for this job is already running

    Expected behavior (?)

    My point of view is that a killed batch should be restartable without errors.

    Should I create a JIRA bug ?

    Gérard COLLIN

  • #2
    Feel free to create a Jira issue, you've hit on something we've gone back and forth on for quite some time. There really is no clean way to figure out if a JobExecution truly is running. At one point we talked about holding on to a lock on the database, which would be released if everything crashed, but I don't think we'll even dive further into this until after release 1. For now, if everything dies, it requires a manual change to add an end time, or a new JobInstance. I would love to hear any ideas you have for a more elegant solution though.

    Comment


    • #3
      Sorry, I didn't see this statement at the end:

      My point of view is that a killed batch should be restartable without errors.
      I agree, but we also need to be able to keep only one execution per job instance running at any given time. You can run as many different JobInstances as you like simultaneously, but the same JobInstance can only be executed one at a time. Again, i think this is more an issue of being able reliably tell when an execution is running than it is about restart in general.

      Comment


      • #4
        Bug created and proposals

        http://jira.springframework.org/browse/BATCH-453

        Alas, I have no solution except these ones:

        - 2 columns in the BATCH_STEP_EXECUTION telling running yes/no, with the oracle (or other) sessionid used.

        When checking if running or not, it checks in Oracle if the sessionid is still alive, because Oracle closes sessions when the client is killed.

        => I'm aware this is a kind of ugly DB-specific solution, but we've used it in another application and it's working well.

        - The batch when running creates a temporary file (the ones deleted when the jvm exits) with a carefully designed name.
        When checking if running or not, it can check if the file exists or not.

        => This solution prevents 2 same jobs running on the same machine thought, not on distinct ones.


        Gérard COLLIN

        Comment


        • #5
          We've discussed the second option before, but it's not something you would really want to do if you were running in an application server. And of course, we can't do something that would be database specific either. The database lock is similar to the second solution only using the database. I imagine that is how we'll end up solving it.

          Comment

          Working...
          X