Announcement Announcement Module
No announcement yet.
Massive memory use with FlowJob, repeating Steps Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Massive memory use with FlowJob, repeating Steps

    We've set up a Spring Batch batch email-sender job that does the following:
    1. Read incoming email details from a JMS Queue.
    2. Don't do any processing, just pass them through...
    3. Turn the details into emails and "write" them to SMTP
    4. Repeat ad infinitum

    This generally works fine, but we've discovered a massive memory leak. After 9 hours or so, can see a single FlowJob, but 748 JobExecution (all 'STARTED'), each holding 778 (!) StepExecution instances. All in all, 900 MB of stuff.

    Here's the Spring config (Spring 3.1, Spring Batch 2.1.9):
        <bean id="jobRepository"
            class="" />
        <bean id="jobLauncher"
            <property name="jobRepository" ref="jobRepository" />
            <property name="taskExecutor">
                <bean class="org.springframework.core.task.SimpleAsyncTaskExecutor"/>
        <bean id="jmsEmailFetcher" class="org.springframework.batch.item.jms.JmsItemReader">
            <property name="jmsTemplate" ref="batchEmailJmsTemplate" />
        <bean id="passthroughProcessor" class="" />
        <bean id="transactionManager"
            class="" />
        <!-- The Spring Batch *Limiter/Decider* -->
        <bean id="ourLimitDecider" class="our.special.InfiniteThrottledExecutionDecider" />
        <!-- The Spring Batch *Job* -->
        <batch:job id="fetchEmailsJobBatch" restartable="true">
            <batch:step id="fetchEmailsStep" next="limitDecision">
                <batch:tasklet throttle-limit="10">
                    <batch:chunk reader="jmsEmailFetcher" processor="passthroughProcessor"
                                 writer="batchEmailService.javaMailItemWriter" commit-interval="100" skip-limit="999999999">  <!-- Might gets *lots* of MQ-not-up-yet fails -->
                            <batch:include class="org.springframework.jms.JmsException" />       <!-- Generally MQ-not-up-yet ConnectionException, or Session-closed (in tests) -->
                            <batch:include class="java.lang.IllegalStateException" />            <!-- Yuk, usually/presumably a test SMTP server isn't yet connected -->
                            <batch:include class="org.springframework.mail.MailSendException" /> <!-- SMTP down... -->
            <batch:decision id="limitDecision" decider="ourLimitDecider">
                <batch:next on="CONTINUE" to="fetchEmailsStep" />
                <batch:end on="COMPLETED" />
    Our InfiniteThrottledExecutionDecider basically returns new FlowExecutionStatus("CONTINUE") every time, to ensure that the fetchEmailsStep gets executed at the end of the flow, and that the Job never completes - at least not until we're ready to stop it ourselves.

    We're not using a database, so we expect some stuff to be held in memory, but not a complete record of everything that's ever been run...

    Is there something wrong with our configuration? Or our approach?

    * * * * *

    Here's a bit from our log file for what we're told is Step #778 and what should be the one and only Job instance.
        23:58:18,782 - INFO  (org.springframework.batch.core.job.SimpleStepHandler) - Duplicate step [fetchEmailsStep] detected in execution of job=[fetchEmailsJobBatch]. If either step fails, both will be executed again on restart.
        23:59:52,257 - INFO  (org.springframework.batch.core.job.SimpleStepHandler) - Executing step: [fetchEmailsStep]
        23:59:52,257 - DEBUG (org.springframework.batch.core.step.AbstractStep) - Executing: id=778
        23:59:52,259 - DEBUG ( - Starting repeat context.
        23:59:52,259 - DEBUG ( - Repeat operation about to start at count=1
        23:59:52,259 - DEBUG (org.springframework.batch.core.scope.context.StepContextRepeatCallback) - Preparing chunk execution for StepContext: [email protected]
        23:59:52,259 - DEBUG (org.springframework.batch.core.scope.context.StepContextRepeatCallback) - Chunk execution starting: queue size=0
        23:59:52,259 - DEBUG ( - Starting repeat context.
        23:59:52,259 - DEBUG ( - Repeat operation about to start at count=1
        ... 5 second JMS timeout ...
        23:59:57,716 - DEBUG ( - Repeat is complete according to policy and result value.
        23:59:57,716 - DEBUG (org.springframework.batch.core.step.item.ChunkOrientedTasklet) - Inputs not busy, ended: true
        23:59:57,716 - DEBUG (org.springframework.batch.core.step.tasklet.TaskletStep) - Applying contribution: [StepContribution: read=0, written=0, filtered=0, readSkips=0, writeSkips=0, processSkips=0, exitStatus=EXECUTING]
        23:59:57,719 - DEBUG (org.springframework.batch.core.step.tasklet.TaskletStep) - Saving step execution before commit: StepExecution: id=778, version=1, name=fetchEmailsStep, status=STARTED, exitStatus=EXECUTING, readCount=0, filterCount=0, writeCount=0 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=1, rollbackCount=0, exitDescription=
        23:59:57,721 - DEBUG ( - Repeat is complete according to policy and result value.
        23:59:57,721 - DEBUG (org.springframework.batch.core.step.AbstractStep) - Step execution success: id=778
        23:59:57,722 - DEBUG (org.springframework.batch.core.step.AbstractStep) - Step execution complete: StepExecution: id=778, version=3, name=fetchEmailsStep, status=COMPLETED, exitStatus=COMPLETED, readCount=0, filterCount=0, writeCount=0 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=1, rollbackCount=0
        23:59:57,723 - DEBUG ( - Completed state=fetchEmailsJobBatch.fetchEmailsStep with status=COMPLETED
        23:59:57,723 - DEBUG ( - Handling state=fetchEmailsJobBatch.limitDecision100
        23:59:57,723 - DEBUG ( - Completed state=fetchEmailsJobBatch.limitDecision100 with status=CONTINUE
        23:59:57,723 - DEBUG ( - Handling state=fetchEmailsJobBatch.fetchEmailsStep
    Problem is, the heap dump shows 748 JobExecutions, and 581,911 StepExecutions. Where are they all coming from?

    * * * * *

    We've tried upping the JMS timeout to a much higher number (30 minutes), and obviously that massively reduces churn in periods of inactivity, like the above scenario. But as activity increases, we're back where we started.

  • #2
    Ouch. I see two issues here:
    1. Even with the use of no database, you are still going to be using a JobRepsoitory (a map based one in memory in this case) so the same data is going to be stored. This will continue to grow as long as the steps are executed.
    2. It's a bad practice to do the type of step looping you're doing (in fact, I'd consider it a bug that we even allow it).

    If your goal is to continue to run this infinately, why leave the fetchEmailsStep step at all? What triggers you transitioning to the decider and how often is that happening?


    • #3
      Thanks for the reply. What confused me was the explosion in the number of JobExecutions. If we've only got one continuously-running Job, I'd expect one JobExecution and 750-ish StepExecutions, not 750 Jobs and 750-squared-ish? Steps...

      My original solution did indeed just have a single Step, no flow, but I couldn't find a way to prevent the Job completing when the JMS timeout expired... which I guess should have made me think "make the JMS timeout infinite". Would that solve our problem?

      (Re: the looping design, I found my way to the Loop Flow Sample via some Google searches and SO articles - maybe there should be a bit of documentation to say how this technique shouldn't be misused.)

      Only two slight objections to the infinite timeout: (a) we get bean errors if we ask Batch to use a JmsTemplate without a "readTimeout" property, and "-1" didn't work; (b) because we can't use Batch Admin in our solution, we desperately want to avoid a situation where the single Job/Step completes prematurely and we have no way to restart the process.

      Presumably setting "readTimeout" to something like 60 mins (we originally used 5 seconds - a newbie error!) or so would/should minimise Job history build-up while reassuring us that the Job will never get COMPLETED.


      • #4
        I would be interested in seeing a test case (if you could post it) that created the scenario with as many StepExecutions you were getting.

        With regards to the JMS timeout and how to handle it in general, to be honest, the JMSItemReader was meant as an extension point more than a fully implemented ItemReader since it has no way of indicating that the input has been exhausted (it will read for ever and never return null). The default timeout for the JMSTemplate is to not timeout (. Your configuration above doesn't include the configuration for your JMSTemplate so I can't say what it doesn't like about it.


        • #5
          I'll see what I can do about the test case - might take a few days to get ready, though. I've got heap dumps, in the meantime, and the attached image shows the whole 900MB hanging off an ArrayList within a Runnable (presumably the one in SimpleJobLauncher) kicked off within the SimpleAsyncTaskExecutor (I haven't tried the Sync one). Each of the 778 StepExecutions holds a JobExecution that itself holds 778 separate StepExecutions.


          Here's the missing bit of config, currently setting JMS read timeout to 30 mins. Removing that line falls foul of line 58 in JmsItemReader, where it deliberately tries to stop the JmsTemplate using the default / indefinite value of 0 as the timeout.

          	<bean id="batchEmailJmsTemplate" class="org.springframework.jms.core.JmsTemplate">
          		<property name="connectionFactory" ref="springBatchJmsCachingConnF"/>
          		<property name="defaultDestination" ref="springBatchIncomingEmailQueue" />
          		<property name="receiveTimeout" value="1800000" />
          	<bean id="springBatchJmsCachingConnF" class="org.springframework.jms.connection.CachingConnectionFactory">
          		<property name="targetConnectionFactory" ref="srcConnectionFactory"/>
          		<property name="reconnectOnException" value="true"/>
          	<jee:jndi-lookup id="srcConnectionFactory" jndi-name="${jndi.jms.connfactory}">
          	<jee:jndi-lookup id="springBatchIncomingEmailQueue" jndi-name="springBatchIncomingEmailQueue">
          Attached Files
          Last edited by aregan; Nov 20th, 2012, 11:49 AM.


          • #6
            As I noted before, the JMSItemReader was meant as an extension point and after a bit of digging, I can see that this is an example of where you may not like the default behavior. The JMSItemReader validates that the receiveTimeout has been set so that the job eventually completes (otherwise it will block forever and the job will never complete). This was added in response to this Jira issue:

            For your case, you really have two options:
            1. Subclass the JmsItemReader and override the setJmsTemplate method so that it does not perform those checks.
            2. Wrap the JMSTemplate in another class (those checks are only checked for instances of JmsTemplate and not any other implementations of JmsOperations.