Announcement Announcement Module
Collapse
No announcement yet.
Loading a file into the database in 1.0.? Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loading a file into the database in 1.0.?

    Hi,
    We have a huge file which needs to be uploaded into a table. We had done this using a customized ItemProviderProcessTasklet in m3. I don't know how to migrate this to current release 1.0.0.

    In m3 we just ask the user to configure the file path, linefilter (to skip the lines not just header and footer), table name, isBatchUpload, batchUploadSize..etc.

    Based on the configuration above, we used to create the ItemProvider and ItemProcessor in the afterProertiesSet() method. If isBatchUpload is true, then we write the contents into a temporary staging file in the table-needed format before uploading into the mysql table using the command "LOAD DATA LOCAL INFILE c:\equity.g INTO TABLE equity". Also if the item process count equals to batchUploadSize, we upload the data into the table and will reset the ItemProcessor in execute() to write records into the stagingfile from [batchUploadSize+1]. This would continue till all the items read from the itemProvider.

    This was the code in m3..

    public class FileLoaderTasklet extends ItemProviderProcessTasklet {
    ...
    ...
    @Override
    public ExitStatus execute() throws Exception {
    if(count++ == 0) {
    if(truncateTable) {
    jdbcTemplate.update("delete from " + tableName);
    }
    }

    ExitStatus status = ExitStatus.CONTINUABLE;

    // Handle the exceptions from ItemProvider gracefully
    try {

    status = super.execute();
    }
    /* This catch block is to catch and rethrow the exception while parsing an xml file of size ZERO
    *
    * */
    catch(DataAccessResourceFailureException eofException){
    if(eofException.getRootCause() instanceof WstxEOFException)
    throw eofException;
    }
    catch(Exception e){
    logger.error("Error in parsing/loading: "+e);
    handleProviderException(e);
    }

    // Finalize
    if(!status.isContinuable()) {
    if(useBatchUpload) {
    loadFileIntoDB(tempStagingFile, tableName);
    }
    count = 0;
    }
    else {
    if(useBatchUpload && count%batchUploadSize == 0){
    jdbcTemplate.update("LOAD DATA LOCAL INFILE c:\equity.g INTO TABLE equity");
    ItemWriterItemProcessor itemProcessor = new ItemWriterItemProcessor();
    FlatFileItemWriter writer = new FlatFileItemWriter();
    tempStagingFile = File.createTempFile("BFL", "data");
    writer.setResource(new FileSystemResource(tempStagingFile));
    String[] columns = BatchJdbcUtils.getColumnNames(jdbcTemplate, tableName);
    writer.setConverter(new FieldSetToDBStringConvertor(columns));
    itemProcessor.setItemWriter(writer);
    // Reset the Item processor
    setItemProcessor(itemProcessor);
    }
    }
    return status;
    }
    ...
    ...
    }

    I am not sure how do i do this using the current ItemOrientedStep approach.

    Any hints?. Please let me know if you need more information.

    thanks in advance,
    ramkris

  • #2
    It's actually much more natural to implement this scenario in 1.0. You can simply use a StepExecutionListener and implement your logic in the afterStep method. The following section in the reference documentation should help to explain it:

    http://static.springframework.org/sp...n.html#d0e3618

    Comment


    • #3
      Thanks lucas for your quick reply.

      But implementing a stepExecutionListener will not solve all the issues. Here is my question. Lets say i have 1,00,000 records in the file. I don't want to upload all the records at once. I want to do that in batches. Lets say i want to insert 10,000 records at a time. How do i achieve this using ItemOrientedStep?. If you see the code above, i do that and create a new FlatFileItemWriter for every 10,000 records.


      However I find a new class BatchSqlUpdateItemWriterin 1.0.0 release. Should i use this class instead of the above approach?Also when the doflush() will be called in this case?.

      thanks,
      ramkris

      Comment


      • #4
        If your data is very clean, using the BatchSqlUpdateItemWriter is a good option. However, if you expect the data might cause a lot of write failures you might want to rethink using it. Flush is called just before the transaction is committed (when the commit interval is complete) If you have an interval of say, 10,000, and one record causes say a PK violation, you have absolutely no way of knowing which record caused the error in order to skip it. In that case, the write simply writes each one to the database one at a time in an attempt to figure out which record was bad.

        Comment


        • #5
          OK. Can you explain in detail how does the writer identifies the bad record?

          Also what is the best way to upload a file into the database considering the file might contain bad records?

          Comment


          • #6
            It's not very complex. When you call write on the writer it stores a list of items to be written. When flush is called, just before committing the business transaction, it attempts to write them all out using Jdbc batch mode. This means that if your commit interval is 100, there will be 100 items written out during this flush. If there's an error, there's no way for the framework to know which item caused it, so after rolling back the transaction and starting from the beginning, it will 'flush' after each item, meaning it will just write it out to the database. In this way the framework can know that a specific item caused the failure, and use it's normal skip logic to mark the record as bad. As you can imagine, if your data is fairly dirty, it will have to do a lot of rollbacks to find and skip the bad record, destroying any speed advantage you gained by using the batched update.

            To be honest, I wouldn't use the BatchUpdate writer at first for any job. Instead I would just use a normal DAO and write the record out like normal. Once you get the job working you can start playing around with the skip limit and commit interval, to see how long an average job takes to process. If it's within acceptable limits, there's really no reason to go any farther. If not, and you think the data is reasonably clean, you could try the BatchSqlUpdateItemWriter in conjunction with increasing the commit interval to see if you get any speed boost. You could also try dumping the data into a staging table, which shouldn't have any issues with write failures (keep in mind that read failures won't cause you any issues either way, since they don't cause a rollback) Once it's in a staging table there's lots of things you could try, including partitioning the data and processing with multiple jobs, or many other techniques. It all really depends upon your data and your processing requirements.

            Comment


            • #7
              Thanks lucas for giving insights on that.

              Originally posted by lucasward View Post
              You could also try dumping the data into a staging table, which shouldn't have any issues with write failures (keep in mind that read failures won't cause you any issues either way, since they don't cause a rollback) Once it's in a staging table there's lots of things you could try, including partitioning the data and processing with multiple jobs, or many other techniques. It all really depends upon your data and your processing requirements.
              It would be great if you tell me how do i dump the file into the staging table using the current 1.0.0 release?. I was able to do this using m3 release by extending the ItemProviderProcessTasklet. As soon as the read count reaches the configured batchUploadSize, i would just dump the staging file into table ( using mySQL LOAD DATA LOCAL INFILE command) and i reset the FlatFileItemWriter which writes the items in a temporary staging file in the table friendly format. I would continue doing this as long as the ExitStaus returned from ItemProviderProcess Tasklet is Continuable.

              thanks,
              Ramkris

              Comment


              • #8
                I just meant as two separate steps, one that loads the file into a staging table, and another that reads and processes from that table.

                Comment


                • #9
                  I agree. But i am not sure how to dump the file into the staging table in the firste step. Reading a single item from the file and writing into the staging table would be a performance issue. Is there any other way?

                  thanks,
                  Ramkris

                  Comment


                  • #10
                    Well, if you're going into a staging table, couldn't you up the commit interval to be extremely high and churn through it pretty quickly? In performance tests I've seen the FlatFileItemReader churn through over a million records in a file in around 5 minutes or so. The only real time sink would be how long committing takes, but you could probably get around that but setting the interval to be very high. If you allow skipping records, there shouldn't be any rollbacks since exceptions on read don't cause them.

                    Comment


                    • #11
                      Thanks Lucas. Which ItemWriter should i use to write the records? A Simple JdbcItemWriter which uses JdbcTemplate to insert the records into the staging table? Is that enough?

                      thanks,
                      Ramkris

                      Comment


                      • #12
                        I would try that at first since it's the simplest solution. If you have access to the SimpleJdbcInsert that was added in 2.5, you can use that for a configuration only option that's quite nice as well. I would also make sure the step loading your data into the database is the SkipLimitStepFactoryBean, so that you can set a skip limit for any parsing errors you get while reading in from the file. Hopefully, if your data is fairly clean of write failures (which is should be on a simple insert) then you should be able to use a fairly high commit interval.

                        Comment

                        Working...
                        X