Announcement Announcement Module
No announcement yet.
Deadlock updating StepExecution at the end of processing a chunk Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deadlock updating StepExecution at the end of processing a chunk

    I am seeing deadlocks with DB2 when I run a partitioned job. The job configured to use retry in the usual way, and I have verified that this is working correctly.

    The DeadlockLoserDataAccessException is at the very last step of processing in TaskletStep$ChunkTransactionCallback.doInTransacti on(TransactionStatus) where it is updating the step execution. Throwing at this point causes processing that partition/step to be terminated in an UNKNOWN state. Since the retry logic is applied within the chunk at the read/process/write stages, there doesn't seem to be any hope for a retry at this level.

    I did see the this thread that deals with handling deadlock in other contexts, but it is clearly not the same situation.

    I am not seeing the deadlocks when running with Oracle, MySQL or SQL Server as the backend. When the job is partitioned into more parallel threads (or the commit interval is made smaller), the deadlocks happen more frequently.

    I'd be interested if anyone has seen this issue. I suspect that DB2 configuration might mitigate the problem (e.g., increase locklist and maxlocks to avoid lock escalation). It does seem to be something of a weak spot in the Spring Batch fault tolerance.

  • #2
    As mentioned in the thread, you need to have retry both at the step/chunk level and a retry interceptor around the jobRepository.

    I think you already have the interceptor at the step/chunk level, a retry interceptor around the jobRepository would take care of the deadlock outside the step/chunk.

    You will have to reset the StepExecution id if dead lock happens during the save operations otherwise you will end up getting 'Flow execution ended unexpectedly'


    • #3
      I have retry configured at the chunk level (via the step/tasklet/chunk/@retry-policy attribute), and I have also added a retry interceptor around jobRepository configured the way

      I don't think that either of these retry cases covers my problem case. The DeadlockLoserDataAccessException is being thrown in [email protected]ansaction where the StepExecution is being updated. This is at a level higher than the chunk retries are happening. The retry interceptor around jobRepository does not apply at this point since there is a transaction already active and the call is updating a StepExecution (as described in an earlier thread).

      The thread that the step is running on gets interrrupted and the step ends in an UNKNOWN state. I don't see any opportunity to retry the step execution. Am I missing something?


      • #4
        I've managed to figure out why the deadlock is possible. It isn't clear why I'm only seeing this happen with DB2.

        The scenario is that a partitioned job has been run. The main thread is polling looking at the status of the StepExecutions for each of partitions:

        Periodically, it is calling jobExplorer.getStepExecution(jobExecutionId, stepExecutionId), which includes the following steps:


        Meanwhile, on a separate thread where the some partition is being processed, in TaskletStep$ChunkTransactionCallback.doInTransacti on, the following calls are being executed immediately before the transaction context commits:

        [3] getJobRepository().updateExecutionContext(stepExec ution);
        [4] getJobRepository().update(stepExecution);

        The ordering is significant:

        [1] Get a S lock on some row in BATCH_STEP_EXECUTION
        [3] Get an X lock on some row in BATCH_STEP_EXECUTION_CONTEXT
        [2] Block getting an S lock on the row of BATCH_STEP_EXECUTION locked in [1]
        [4] Block getting an X lock on the row of BATCH_STEP_EXECUTION_CONTEXT locked in [3]

        In my case, the solution is to change my polling loop in my main thread to not use JobExplorer.getStepExecution(Long, Long) as long as the one or more partitions are still executing. That is, I use the processing grid's status to determine when all partitions have finished before retrieving the StepExecutions to determine the batch outcome for each partition.

        This could be made safe by reordering the update operations in TaskletStep$ChunkTransactionCallback.doInTransacti on so that the StepExecution is updated before the StepExecutionContext. That is, make sure that resources are locked in the same order in both cases.


        • #5
          Testing this solution does solve the deadlock problem.