Announcement Announcement Module
Collapse
No announcement yet.
1 million records take 130 minutes to process, how to improve? Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • 1 million records take 130 minutes to process, how to improve?

    Hello,
    I have written spring batch application to read records from one database and write to the another.
    I would like to baseline or can say measure the performance of batch job. To check if this can be further improved

    Total Records red are : 1.2 Million (From Oracle Database)
    Total Records updated are : 1.2 Milion (To Informix Database)

    Time Taken to upload all : 130 Minutes.

    Below components has been configured in the application:

    Code:
    <batch:job id="updateDetailsJob" restartable="true" job-repository="jobRepository">
    <!--  STEP 1 : Truncate Table --> 
    <batch:step id="truncateStep" next="stepX">
      <batch:tasklet ref="truncateTasklet" />
    </batch:step>
    <!--  STEP 2 : Retrieve Unique Record --> 
    <batch:step id="stepX">
     <batch:partition step="retrieveUniqueOriginalRecordStep" partitioner="partitioner">				
      <batch:handler grid-size="5" task-executor="taskExecutor"/>			
     </batch:partition>		
    </batch:step>
    </batch:job>
    <batch:step id="retrieveUniqueOriginalRecordStep" >
    <batch:tasklet>
      <batch:chunk reader="uniqueReader"  writer="uniqueWriter" commit-interval="5000">
      </batch:chunk>
    <batch:listeners merge="true">
      <batch:listener>
          <bean class="com...listener.ItemFailureLoggerListener"/>
      </batch:listener>
    </batch:listeners>
    </batch:tasklet>
    </batch:step>
    		
    <bean id="truncateTasklet"  class="com...tasklet.TruncateTasklet">
      <property name="dao" ref="truncateDao"></property>
    </bean>
    
    <bean id="uniqueReader" class="org.springframework.batch.item.database.JdbcPagingItemReader" scope="step">
     <property name="dataSource" ref="aimsysDataSource" />
     <property name="queryProvider">
     <bean class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
        <property name="dataSource" ref="dataSource" />
        <property name="selectClause" value="SELECT item, board, rank" />
        <property name="fromClause" value="FROM examDetails" />
        <property name="whereClause" value="ROWNUM &gt;= :minId and ROWNUM &lt;= :maxId" />
        <property name="sortKey" value="item" />
     </bean>
     </property>
     <property name="pageSize" value="5000"/>
      <property name="parameterValues">
        <map>
          <entry key="minId" value="#{stepExecutionContext[minValue]}" />
          <entry key="maxId" value="#{stepExecutionContext[maxValue]}" />
        </map>
       </property>
      <property name="rowMapper" ref="examDetailMapper" />
    </bean>
    <bean id="uniqueWriter" class="org.springframework.batch.item.database.JdbcBatchItemWriter">
     <property name="dataSource" ref="dataSource2" />
     <property name="assertUpdates" value="true" />
     <property name="sql" value=" INSERT INTO ..... VALUES (...);" />
     <property name="itemSqlParameterSourceProvider">
       <bean class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
     </property>
    </bean>
      
    <bean id="examDetailMapper" class="com...domain.ExamDetailsRowMapper"/>
    
    <bean id="partitioner" class="com...partition.ColumnRangePartitioner">		
     <property name="dataSource" ref="dataSource" />		
     <property name="table" value="examDetails" />		
     <property name="column" value="ROWNUM" />	
    </bean>
    
    <bean id="taskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor"/>
    </beans>
    Any help is greatly appreciated as business says this is not expected to run for this much longer.

  • #2
    Paging with Oracle is slow especially with a large number of records (it selects all records and then does a subselect on all the records). A cursor instead of paging might improve things a little (at least the reading).

    For writing you might try a different itemSqlParameterSourceProvider as the current one uses reflection on the item to retrieve properties, which might not be the most efficient.

    Comment


    • #3
      Originally posted by Marten Deinum View Post
      Paging with Oracle is slow especially with a large number of records (it selects all records and then does a subselect on all the records). A cursor instead of paging might improve things a little (at least the reading).

      For writing you might try a different itemSqlParameterSourceProvider as the current one uses reflection on the item to retrieve properties, which might not be the most efficient.
      Thank you Marten for the areas to look:
      I have modified the configuration and run on the subset of data for 13K records
      1. Changed from Paging to Cursor Reader
      2. Custom Prepared setter implemented and configured along with the JdbcBatch writer.
      3. Changed the configuration of DataSource, previously it was DriverManagedDataSource, now its been configured to the DBCP-BasicDataSource.

      Below is stat:
      ended RUNTIME [61280] ms Status [Success] - Existing
      ended RUNTIME [60868] ms Status [Success] - Changed Reader DataSource and Paging JDBCReader used
      ended RUNTIME [60006] ms Status [Success] - Changed Writer DataSource and Used PreparedStatementSetter

      Improvement : 13Seconds for 13K records

      Based on above, i have submitted the batch to execute the complete set of data.

      Anything else I need to look? please share your view.

      Comment


      • #4
        Hello There
        Submitted code improved performance by 20%.
        Also, Partitioner concept was rejected by team to avoide complexity.
        But, still by going ahead with suggested changes I have achieved 20% improvement.
        Posting latest configuration :
        Code:
        <batch:job id="updateDetailsJob" restartable="true" job-repository="jobRepository">
        <!--  STEP 1 : Truncate Table --> 
        <batch:step id="truncateStep" next="retrieveUniqueOriginalRecordStep">
          <batch:tasklet ref="truncateTasklet" />
        </batch:step>
        <!--  STEP 2 : Retrieve Unique Record --> 
        <batch:step id="retrieveUniqueOriginalRecordStep" >
        <batch:tasklet>
          <batch:chunk reader="uniqueReader"  writer="uniqueWriter" commit-interval="100000">
          </batch:chunk>
        <batch:listeners merge="true">
          <batch:listener>
              <bean class="com...listener.ItemFailureLoggerListener"/>
          </batch:listener>
        </batch:listeners>
        </batch:tasklet>
        </batch:step>
        
        </batch:job>
        	
        <bean id="truncateTasklet"  class="com...tasklet.TruncateTasklet">
          <property name="dao" ref="truncateDao"></property>
        </bean>
        	
        <bean id="uniqueReader" class="org.springframework.batch.item.database.JdbcCursorItemReader">
         <property name="dataSource" ref="dataSource" />
        <!--Demonstrative purpose only actual query not displayed due to business requirement-->
         <property name="sql" value="SELECT item, board, rank FROM examDetails" /> 
         <property name="rowMapper" ref="examDetailMapper" />
        </bean>
        
        <bean id="uniqueWriter" class="org.springframework.batch.item.database.JdbcBatchItemWriter">
         <property name="dataSource" ref="dataSource2" />
         <property name="assertUpdates" value="true" />
         <property name="sql" value=" INSERT INTO ..... VALUES (...);" />
         <property name="itemPreparedStatementSetter" ref="uniqueParameterSetter"/>
        </bean>
        
        <bean id="uniqueParameterSetter" class="com...item.UniquePreparedStatementSetter"/>  
        
        <bean id="examDetailMapper" class="com...domain.ExamDetailsRowMapper"/>
        
        </beans>

        DataSource Configuration File :

        Code:
        <!-- OLD Configuration -->
        <bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource" >
        <property name="driverClassName" value="${sys.jdbc.driverClass}" />
        <property name="url" value="${sys.jdbc.url}" />
        <property name="username" value="${sys.jdbc.user}" />
        <property name="password" value="${sys.jdbc.password}" />
        </bean>
        
        <!-- New Configuration -->	
        <bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
         <property name="driverClassName" value="${sys.jdbc.driverClass}"/>
         <property name="url" value="${sys.jdbc.url}"/>
         <property name="username" value="${sys.jdbc.user}"/>
         <property name="password" value="${sys.jdbc.password}"/>
         <property name="maxIdle" value="10"/>
         <property name="maxActive" value="100"/>
         <property name="maxWait" value="10000"/>
         <property name="validationQuery" value="select 1 from dual"/>
         <property name="testOnBorrow" value="false"/>
         <property name="testWhileIdle" value="true"/>
         <property name="timeBetweenEvictionRunsMillis" value="1200000"/>
         <property name="minEvictableIdleTimeMillis" value="1800000"/>
         <property name="numTestsPerEvictionRun" value="5"/>
         <property name="defaultAutoCommit" value="false"/>
        </bean>
        Please let me know any other areas where I can look again to improve. The job still run more than 1.5 hour. Your help is greatly appreciated.

        Comment


        • #5
          Disable indexes, enable/recreate after writing ... Use a ConnectionPool for writing (or maybe a SingleConnectionDataSource).

          Figure out, using profiling tools, what is taking the most time and see if you can fix that.

          Comment


          • #6
            Hi Marten,

            Option 1: Has been ruled out by the team, as the Database Table are already in production and not allowed to alter on them, except truncating
            Option 2: Will definitely introduce now the ConnectionPool for Writing. SingleConnectionDataSource - Tried and tested, no much gain compare to BasicDataSource.
            Option 3: Can you list down any profiling tool for measuring the botteleneck for the spring batch?

            Thanks

            Comment


            • #7
              You don't have to alter tables for the indexes, with Oracle you can simply disable them and later enable them. This can be done by issueing a query. This could be done at start of the job and end of the job... INdexes (Unique or just for quering) are a real killer for performance as each insert will trigger an update of the index.

              Regarding 3 just use your normal tools to identify problems (YourKit etc) .

              Comment


              • #8
                In a previous post, you noted that partitioning was not allowed because of complexity (I'm not sure I'd agree). What about just a multithreaded step using the JdbcPagingItemReader?

                Comment

                Working...
                X