Announcement Announcement Module
Collapse
No announcement yet.
Wishlist / Coding Examples for the following... Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wishlist / Coding Examples for the following...

    I've started evaluating Spring Batch and it looks promising. I've worked on several projects that do a fair amount of file batch processing. From my experience, there are a some features that the framework needs to support before I can comfortably recommend that our group use it.


    Specifically, does the framework support:
    • native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"? For us it's fine if this breaks the transaction boundary demarcations.
    • multiple record types (i.e. header/detail-rectype-1/detail-rectype-2/trailer)?
    • optional vs required fields (I assume you'd have to use something like the ValidatingItemProvider for this)
    • optional vs required record types (I assume you'd have to use something like the ValidatingItemProvider for this)
    • field padding (left vs right and padding char)
    • field masks (i.e. mask="MM/dd/yy", or masks similar to the java.text.Format)
    • field-by-field default values for empty/null fields (i.e. if field1 is empty or blank, default it to today)
    • On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n). Perhaps the FieldSet class could contain an attribute for that info?
    • Record separators for files that don't use a CR or CR/LF for the end of the line (i.e. field1|field2|field3|field4!lastfieldinrecord~). Perhaps the solution is to use the RecordSeparatorPolicy and/or SuffixRecordSeparatorPolicy?
    • What if you want to do multi passes of the file - one to validate it (especially useful for files containing multiple record types), then one to process it?


    I'm very familiar with Spring but Spring Batch is totally new to me so perhaps it does support what I'm asking and I just overlooked how to accomplish what I'd like.

    Can someone please help point me at an example of how to do some of the things I have a question on?

  • #2
    A few quick pointers:

    native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"?
    no, framework does not provide support vendor-specific database features

    multiple record types (i.e. header/detail-rectype-1/detail-rectype-2/trailer)?
    take a look at multilineOrderJob

    On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n).
    you'll want to implement a custom LineTokenizer

    What if you want to do multi passes of the file
    consider making each passing of the file a separate step in the job

    Comment


    • #3
      native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"?
      a tasklet that executes a system command might be what you are looking for http://jira.springframework.org/browse/BATCH-152

      Comment


      • #4
        native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"? For us it's fine if this breaks the transaction boundary demarcations.
        As Robert mentioned, there isn't platform specific item readers, however, there is no reason why you can't call an oracle specific class. I've worked with multiple clients that have done so easily.

        multiple record types (i.e. header/detail-rectype-1/detail-rectype-2/trailer)?
        There is a sample job for this (multi-line job). All you need is to define a LineTokenizer for each record type.

        optional vs required fields (I assume you'd have to use something like the ValidatingItemProvider for this)
        This is something we've discussed, and is definitely possible with the FixedLengthTokenizer by not including a range within the column definition. However, it isn't possible with the DelimitedLengthTokenizer. You could not map a field to a particular object, but with 'automapping' there would be issues. It should be added as an issue in Jira. Can you add one with an example business case where you use optional fields?

        optional vs required record types (I assume you'd have to use something like the ValidatingItemProvider for this)
        By default, if you use the PrefixMatchingCompositeLineTokenizer, ever record type would be optional. However, you could easily write your own LineTokenizer that knows which record types are optional or required.

        field padding (left vs right and padding char)
        Padding should work for input (see BATCH-261). And there are setters for padding of fields in the FixedLengthAgreggator. However, it should probably be more fine grained that it is currently.

        field masks (i.e. mask="MM/dd/yy", or masks similar to the java.text.Format)
        Supported

        field-by-field default values for empty/null fields (i.e. if field1 is empty or blank, default it to today)
        Tokenizer's don't do this by default, although a FieldSetMapper that you write could easily do it.

        On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n). Perhaps the FieldSet class could contain an attribute for that info?
        There is a setter for the delimiter type in the DelimitedTokenizer, however, it will be used for every field in the file. I'm curious what the use-case would be for having multiple delimiters per file?

        Record separators for files that don't use a CR or CR/LF for the end of the line (i.e. field1|field2|field3|field4!lastfieldinrecord~). Perhaps the solution is to use the RecordSeparatorPolicy and/or SuffixRecordSeparatorPolicy?
        There is a RecordSeperatorPolicy as part of the FlatFileItemReader.

        What if you want to do multi passes of the file - one to validate it (especially useful for files containing multiple record types), then one to process it?
        You could easily have multiple steps that correspond to these 'passes'?

        Comment


        • #5
          Robert and Lucas,

          Thanks for all the info - that's very helpful. I'll start to try out your suggestions a bit tonight and will try to open that Jira ticket for the optional/required fields in the next day or two.

          As far as Lucas's question regarding:

          There is a setter for the delimiter type in the DelimitedTokenizer, however, it will be used for every field in the file. I'm curious what the use-case would be for having multiple delimiters per file?
          I guess a usecase would be we have a file that we use pipe delimiters for except for delimiting a few of the fields in the record since those fields might themselves contain pipes. It's not a great example, since one could argue that we should pick a delimiter like x00 or the like that we're guaranteed to never encounter in any of our fields, but unfortunately we're limited to the chars that the system outputting the datafile can generate. I've also used the batch processing tool Ab Initio (look it up on Wikipedia if you're not familar with it) and it's able to handle various chars for delimiting each field.

          I'll let you know how I make out.

          Comment


          • #6
            Interesting, I've seen a lot of projects pick pipe over comma delimited because of the likelihood of commas being part of the data, but usually pipes are relatively safe. It could be added to the Tokenizer, but it seems like a minority use case and probably out of scope for Release 1. However, please add it to JIRA, and if a lot of others need the feature, it could be moved up.

            Also, if a reliable delimiter can't be chosen, is using a fixed-length format a possibility?

            Comment


            • #7
              Originally posted by lucasward View Post
              Interesting, I've seen a lot of projects pick pipe over comma delimited because of the likelihood of commas being part of the data, but usually pipes are relatively safe. It could be added to the Tokenizer, but it seems like a minority use case and probably out of scope for Release 1. However, please add it to JIRA, and if a lot of others need the feature, it could be moved up.

              Also, if a reliable delimiter can't be chosen, is using a fixed-length format a possibility?
              Then why do not support echoing like "\\" in Java (i.e. single delimiter is a delimiter, doubled delimiter is a literal value of a single delimiter). And normally it is not so complicated to double delimiters inside the fields on output - not more complicated then use different delimiters for different fields. And this solution is 100% safe.

              Regards,
              Oleksandr

              Comment


              • #8
                N.B. The DelimitedLineTokenizer adopts the Microsoft-inspired convention that a field containing a delimiter (line or field delimiter) can be escaped by quoting it. Inside such a field a quote character is escaped by repeating it. This is what you get from Excel (for instance) when you do Save As... -> CSV, so it covers a large constituency already. The Javadocs mention this behaviour in the setter for the quote character (which defaults to ").

                Comment


                • #9
                  It is good that this convention is supported, but it seems to be slightly overcomplicated - simple doubling of the delimiter is simpler to produce and to parse. And should provide (marginally) better performance which may be not so bad in batch applications. As well processing of quoted string requires virtually unlimited look-ahead (especially, if file being processed is misformated), simple delimiter duplication requires only single-character look-ahead and is much safer in this respect.

                  So it is quite reasonable to support such strategy as well. Anyway, it is not very likely (while still possible), that CSV files for batch-processing would be created by Excel, as Excel is mostly interactive tool.

                  Regards,
                  Oleksandr

                  Originally posted by Dave Syer View Post
                  N.B. The DelimitedLineTokenizer adopts the Microsoft-inspired convention that a field containing a delimiter (line or field delimiter) can be escaped by quoting it. Inside such a field a quote character is escaped by repeating it. This is what you get from Excel (for instance) when you do Save As... -> CSV, so it covers a large constituency already. The Javadocs mention this behaviour in the setter for the quote character (which defaults to ").

                  Comment


                  • #10
                    One thing I would also like to point out is that delimiters are set for an entire file. Meaning, that there is one setter for a delimiter that is used. Attempting to set a delimiter per field would require significantly more configuration than there is currently, for very little value. At a minimum, if this feature is needed, it would need to be a separate tokenizer all together, so that the more common use case would be easier to configure. However, with that being said, I still don't understand what setting a delimiter per field would add that couldn't more easily be accommodated by using fixed-length formatting.

                    Comment


                    • #11
                      To put it short - size reduction. sometimes very significant.
                      But anyway, delimiter with possibility to escape it is much better solution then delimiter per field and relatively often, then fixed length.
                      Originally posted by lucasward View Post
                      One thing I would also like to point out is that delimiters are set for an entire file. Meaning, that there is one setter for a delimiter that is used. Attempting to set a delimiter per field would require significantly more configuration than there is currently, for very little value. At a minimum, if this feature is needed, it would need to be a separate tokenizer all together, so that the more common use case would be easier to configure. However, with that being said, I still don't understand what setting a delimiter per field would add that couldn't more easily be accommodated by using fixed-length formatting.

                      Comment


                      • #12
                        ItemReader for excel

                        Hi, I need to implement an ItemReader for an excel file. How do I do this? I've taken a look at this thread because its the closest thing I can find that is related to creation of an ItemReader for Excel. anyway, I'm not sure but I think I cannot use FlatFileItemReader for excel files specially since the Excel file that I need to parse contains multiple tabs.

                        Regards,
                        Raymond

                        Comment


                        • #13
                          The LineReader that we use internally and the DelimitedLineTokenizer are designed to work with Excel-generated CSV files. Not sure about tabs, but I can't see why it wouldn't work. Did you try it?

                          Comment


                          • #14
                            Hi Dave,

                            Thanks for the quick reply.

                            You see I'm a bit confused on how the read() method is called. I'm trying to parse an Excel file. I pass the excel file name and sheet name to the constructor of the CustomItemReader I created (which extends FlatFileItemReader and implements ItemReader). In the constructor of my CustomItemReader, I used JExcelAPI to load the worksheet whose data I needed to process, in the read() method, I'm supposed to return the contents of each row. However, for some reason, the read() method is not called at all. I'm at a loss on why this happens. Please see below my job configuration and my classes:

                            <beans ...>

                            <import resource="applicationContext.xml"/>

                            <bean id="myJob" class="org.springframework.batch.core.job.SimpleJo b">
                            <property name="name" value="myJob" />
                            <property name="steps">
                            <list>
                            <bean id="process" parent="simpleStep">
                            <property name="itemReader" ref="customExcelReader"/>
                            <property name="itemWriter" ref="customExcelWriter"/>
                            </bean>
                            </list>
                            </property>
                            </bean>

                            <bean id="customExcelReader" class="testing.reader.CustomExcelSheetReader">
                            <constructor-arg value="d:/sample.xls"/>
                            <constructor-arg value="Sample"/>
                            <property name="lineTokenizer">
                            <bean
                            class="org.springframework.batch.item.file.transfo rm.DelimitedLineTokenizer">
                            </bean>
                            </property>
                            <property name="fieldSetMapper">
                            <bean
                            class="testing.mapping.MyFieldSetMapper" />
                            </property>
                            </bean>

                            <bean id="customExcelWriter" class="testing.writer.CustomExcelSheetWriter">
                            </bean>
                            </beans>

                            Here's my code for the custom ItemReader I created:

                            public class CustomExcelSheetReader extends FlatFileItemReader implements ItemReader {

                            private List<DomObject> itemList;

                            public CustomExcelSheetReader(String excelFileName, String sheetName) {
                            Workbook workbook = null;
                            try {
                            workbook = Workbook.getWorkbook(new File(excelFileName));
                            } catch (BiffException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                            } catch (IOException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                            }

                            Sheet sheet = null;
                            if (workbook.getSheet(sheetName) != null) {

                            itemList = new ArrayList<DomObject>();
                            sheet = workbook.getSheet(sheetName);

                            for (int i=1; i< sheet.getRows(); i++) {
                            DomObject data = new DomObject();
                            //call setter methods of DomObject
                            itemList.add(data);
                            }
                            }
                            }

                            /* (non-Javadoc)
                            * @see org.springframework.batch.item.ItemReader#mark()
                            */
                            public void mark() throws MarkFailedException {
                            // TODO Auto-generated method stub

                            }

                            /* (non-Javadoc)
                            * @see org.springframework.batch.item.ItemReader#read()
                            */
                            public Object read() throws Exception, UnexpectedInputException,
                            NoWorkFoundException, ParseException {
                            // TODO Auto-generated method stub
                            System.out.println("read called");
                            System.out.println(testItems.isEmpty());
                            if (!itemList.isEmpty()) {
                            return itemList.remove(0);
                            }
                            return null;
                            }

                            /* (non-Javadoc)
                            * @see org.springframework.batch.item.ItemReader#reset()
                            */
                            public void reset() throws ResetFailedException {
                            // TODO Auto-generated method stub

                            }

                            }

                            Thanks!
                            Raymond

                            Comment


                            • #15
                              I don't understand why you extend FlatFileItemReader. There isn't much point it you are not reading a flat file. Also your implementation of ItemReader is not honouring the reset() and mark() contract (so rollbacks will not work). And it doesn't implement ItemStream with the index of your row list, so it isn't restartable.

                              Other than that I can't see any issues explaining why read() is not working. How did you launch the job?

                              (Please use [code][/code] tags to post code and stack traces.)

                              Comment

                              Working...
                              X