Announcement Announcement Module
Collapse
No announcement yet.
resume the job after power failure Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • resume the job after power failure

    hi, any method available for us to resume the job after power failure.

    Because find out if i manually hang the server and restart again and i try to start the job again, i pass in the same job and paramater. it say the job is running and hit running exception. Anyone have any idea to solve tis kind of problem? thanks

  • #2
    In your case the framework didn't get a chance to update the metadata with 'FAILED' status. You can do that manually and job will restart happily, but it's up to you to decide whether data is in consistent state - framework can give no correctness guarantees in case of power failure.

    Comment


    • #3
      I dont think it is a gud case for us to manually change to failed except we can determine there is any power failure case when we start the jboss server. Because I use the scedular to start the job and alwasy check whether the job is running or not.

      Comment


      • #4
        Unfortunately, it's the best option we have right now for 1.1. It's something we will be addressing in 2.0, however.

        Comment


        • #5
          ok. hoping 2.0 fix this issue! tys for reply

          Comment


          • #6
            The solution for restarting batches after hard stops

            Originally posted by lucasward View Post
            Unfortunately, it's the best option we have right now for 1.1. It's something we will be addressing in 2.0, however.
            We are planning to use SpringBatch as a complement to our huge amount of COBOL batches. And we are aiming to be able to run Java batches the very same way as we run our COBOL batches. This includes using the same scheduler and the same skilled operators that we have today. There is no chance to have the operators to manually update the repository tables for numbers of batches after the rare case of a hard stop.

            For us, it's a requirement to have this solved by the framework before we can use it in a larger scale. Therefore, we are very interested in getting to know how the solution will look like in 2.0. Is it possible for you to share information with us on this subject at this point-in-time?

            BTW, what is the plan for releasing 2.0?

            Thanks in advance, Len...

            Comment


            • #7
              Just a idea !

              Maybe you can verify persisted data at the server startup !
              Any running jobs in database will be changed to failed status at the startup.

              Comment


              • #8
                Originally posted by lenhen View Post
                Is it possible for you to share information with us on this subject at this point-in-time?
                2.0 has a JobExplorer interface that allows you to pull out the JobExecution and stop it, then save back to database with the JobRepository. JobOperator is also available as a wrapper for those operations using primitives. It should be easy for you to provide your operators with a UI for carrying out that operation.

                BTW, what is the plan for releasing 2.0?
                The schedule is in JIRA: http://jira.springframework.org/browse/BATCH. We don't anticipate any changes right now, but you never can be sure.

                Comment


                • #9
                  Where to find JobExplorer documented?

                  Originally posted by Dave Syer View Post
                  2.0 has a JobExplorer interface that allows you to pull out the JobExecution and stop it, then save back to database with the JobRepository. JobOperator is also available as a wrapper for those operations using primitives. It should be easy for you to provide your operators with a UI for carrying out that operation.
                  Thanks for your reply. I downloaded the User's Guide and the only change I could find was that "1.0" was changed to "2.0". And I can't find any information about the JobExplorer interface in the User's Guide. Where can I read about JobExplorer?

                  Furthermore, I don't think the solution is to provide a UI where operators should change values in the JobExecution object. In our current batch environment (COBOL) we have a flag in our own repository that has two values. It's either that the job instance has completed ('OK') or it has not completed ('NC'). The 'NC' status is to be interpreted as if the job is still running or that it has been abnormally terminated by some reason, like for example a power outage. The responsibility then lies on the scheduler to determine whether the job should be rescheduled or not.

                  In SpringBatch the status can be "running" which must be interpreted as if the job IS running or that it has been abnormally terminated. Someone then has to determine which of the interpretations is correct and if the job has been abnormally terminated, change the status manually and then reschedule the job.

                  In a large batch environment (we have thousands of batch jobs) it would not be feasible with all these manual interventions to get the batch jobs running again after for example a power outage.

                  So, we hope for a change in SpringBatch to make it optional to be able to restart jobs without the need to clear the status manually.

                  /Len...

                  Comment


                  • #10
                    The user Guide is not up to date, but the Javadocs are, and the interfaces are quite self-explanantory for the JobExplorer and JobOperator (I hope).

                    I'm not sure how you expect to achieve a recovery from a power failure without changing the status of the existing JobExecutions. Surely you would need to provide your operators with some tools to signal to the Batch system that there had been an abnormal and catastrophic event (either manually or automatically)? The system on its own can't figure out that the existing executions are not still running - someone has to send a signal to something to say that all those RUNNING status values are not actually valid. I don't think the number of jobs is relevant - after a power failure you would either know that all the existing jobs were running or not. Perhaps you could implement a timeout (of your choosing) - if you haven't heard from a job for x hours then consider it dead. The low level APIs to implement those kinds of features are basically in place in 2.0 (suggestions for tweaks welcome).

                    If you have some concrete suggestions for improvements, features, or use cases that we could implement time is running out for 2.0, so please tell us what is needed.

                    Comment


                    • #11
                      Are the steps to update the database to resolve this issue documented? We had a power failure and I have two jobs that are now failing to start with the message "instance is already running: A job execution for this job is already running: JobInstance: id=124110"

                      I found both job instances in the tables but I'm not sure what to do with them to resolve the issue. Should I just delete all rows related to these two job instances or can I modify the status columns to get the framework to realize they fail and recover?

                      Code:
                      SELECT *
                      FROM BATCH_JOB_INSTANCE BJI
                          LEFT JOIN BATCH_JOB_EXECUTION BJE ON BJI.JOB_INSTANCE_ID = BJE.JOB_INSTANCE_ID
                          LEFT JOIN BATCH_STEP_EXECUTION BSE ON BJE.JOB_EXECUTION_ID = BSE.JOB_EXECUTION_ID
                      WHERE BJI.JOB_INSTANCE_ID IN (124108, 124110)
                      Code:
                      JOB_INSTANCE_ID  VERSION  JOB_NAME           JOB_KEY                                    JOB_EXECUTION_ID  VERSION  JOB_INSTANCE_ID  CREATE_TIME              START_TIME               END_TIME  STATUS   CONTINUABLE  EXIT_CODE  EXIT_MESSAGE  STEP_EXECUTION_ID  VERSION  STEP_NAME                                              JOB_EXECUTION_ID  START_TIME               END_TIME  STATUS   COMMIT_COUNT  ITEM_COUNT  READ_SKIP_COUNT  WRITE_SKIP_COUNT  ROLLBACK_COUNT  CONTINUABLE  EXIT_CODE    EXIT_MESSAGE
                      124108           0        statsPurgingJob    STORED_DATE=Sun Jan 11 06:00:35 CST 2009;  130388            2        124108           2009-01-11 06:01:30.124  2009-01-11 06:01:30.135  <null>    STARTED  N            UNKNOWN    <null>        151300             0        org.jasig.portal.stats.purge.StatsPurgingStep#1eb2c1b  130388            2009-01-11 06:01:32.022  <null>    STARTED  0             0           0                0                 0               Y            CONTINUABLE  <null>
                      124110           0        statsAggregateJob  STORED_DATE=Sun Jan 11 05:51:52 CST 2009;  130390            2        124110           2009-01-11 06:03:00.118  2009-01-11 06:03:00.126  <null>    STARTED  N            UNKNOWN    <null>        151302             0        StatsAggregatingStep                                   130390            2009-01-11 06:03:00.892  <null>    STARTED  0             0           0                0                 0               Y            CONTINUABLE  <null>

                      Comment


                      • #12
                        I'm not sure if this is the appropriate solution but after some trial & error I found the following SQL fixed my problem. I'd still love to get feedback on if this is the correct approach.

                        Code:
                        UPDATE BATCH_JOB_EXECUTION BJE
                        SET STATUS='FAILED', EXIT_CODE='FAILED', END_TIME=SYSDATE
                        WHERE BJE.JOB_EXECUTION_ID IN (130388, 130390);
                        
                        UPDATE BATCH_STEP_EXECUTION BSE
                        SET STATUS='FAILED', EXIT_CODE='FAILED', END_TIME=SYSDATE
                        WHERE BSE.JOB_EXECUTION_ID IN (130388, 130390);

                        Comment


                        • #13
                          Originally posted by edalquist View Post
                          I'm not sure if this is the appropriate solution but after some trial & error I found the following SQL fixed my problem. I'd still love to get feedback on if this is the correct approach.

                          Code:
                          UPDATE BATCH_JOB_EXECUTION BJE
                          SET STATUS='FAILED', EXIT_CODE='FAILED', END_TIME=SYSDATE
                          WHERE BJE.JOB_EXECUTION_ID IN (130388, 130390);
                          
                          UPDATE BATCH_STEP_EXECUTION BSE
                          SET STATUS='FAILED', EXIT_CODE='FAILED', END_TIME=SYSDATE
                          WHERE BSE.JOB_EXECUTION_ID IN (130388, 130390);

                          I haven't had a chance to dive into the inner workings of the way spring batch interprets the requirements for when you can restart a job, but it does seem that you need to put both the status and exit code to failed for it to accept a new job with the same parameters. For anyone using the 1.x release, I guess you have to write some sort of boot strap code to examine all jobs that have running statuses, adjust them to failed, pull out the job and job parameters and pass them back into the joblauncher to properly restart them?

                          Comment


                          • #14
                            You're basically correct. We have no way of knowing if a job with a started of other than 'FAILED' or 'STOPPED' is still running or not. At some point (I think in 2.0) we added the last_updated column, which gives a little more detail. However, since Spring Batch has no knowledge about the application or what the commit intervals might be, we leave it to the user to examine and set them to failed. If you want to automate this, the new interfaces in 2.0 make it much easier to query for these before starting your job.

                            Comment


                            • #15
                              Property to tell Spring Batch that it shouldn't care about status when restarting

                              Originally posted by Dave Syer View Post
                              If you have some concrete suggestions for improvements, features, or use cases that we could implement time is running out for 2.0, so please tell us what is needed.
                              We are running our jobs in a mature batch environment that uses a scheduler to schedule batch jobs in sequences and with dependencies at predefined times. The jobs are handled by a job entry subsystem that has full control over the jobs and knows whether they are running or has been running. The jobs are numbered and there will never be two or more copies of the same job run at the same time. The output from the jobs are stored in a safe place.

                              As I mentioned earlier in this post, we have other statuses in our home grown repository. We can only have two statuses, one is Not Completed, which implies that the job is running or that it has been prematurely ended. The other is Completed, which means that the job has completed(!). So, if the status after a power outage is Not Completed, the job can be restarted without any problems since we know outside of Spring Batch that the job is not running. On the other hand, if the status is Not Completed we will never restart the job as long as we know that the job is still running - and we know that outside of Spring Batch.

                              Our proposal is that there should be a way to configure Spring Batch to tell it that we have control over the jobs outside of Spring Batch. In that way we could restart the jobs after a power failure or any other hard stop of the job. I.e., we would like a property to tell Spring Batch to not care about the status so that we can restart jobs without having to fiddle with the status flag in the repository - neither by hand nor by using an application programming interface.

                              Any views on this?

                              /Len...

                              Comment

                              Working...
                              X