Announcement Announcement Module
Collapse
No announcement yet.
There has got to be a faster way of loading dynamic CSV headers... Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • There has got to be a faster way of loading dynamic CSV headers...

    I'm writing an importer for very bog standard CSVs where the exact order the attributes come in is user defined in the header (where attribute names are comma separated). Now, I'm new to Spring Batch (particularly 2.0) but, I thought this would be a relatively simple thing to arrange. I've now got it to work (I think), but since I had to go about it in such a arse backwards way I thought I would post my findings here for three reasons,

    1) Someone might be able to explain a way of doing it better and faster.
    2) Someone might be able to spot some important step I missed (I'm still not sure my job will recover from interruption correctly).
    3) This might be a useful record for anyone else trying to solve the same (very common I would have thought) problem.

    Let's start with the FlatFileItemReader, mainly because that was where I started and originally thought I would very quickly finish up.

    Code:
    	<bean id="productFileItemReader" 
    	class="org.springframework.batch.item.file.FlatFileItemReader">
    		<property name="resource" value="classpath:dataload/products.csv" />
    		<property name="lineMapper" ref="lineMapper" />
    		<property name="linesToSkip" value="1" />
    		<property name="skippedLinesCallback" ref="csvHeaderCallbackHandler" />
    	</bean>
    The last two attributes seem to be new to Spring Batch 2.0. Back when I was doing the same thing in Spring Batch 1.1 I only had a "firstLineIsHeader" attribute instead. The new flexibility is cooller, but one thing the old style did "for free" was automatically assume that the first line was full of attribute names and populate my tokenizer with them. The new FlatFileItemReader doesn't seem to do that and I had to create a LineCallbackHandler to try and achieve the same things myself.

    Code:
    	<bean id="csvHeaderCallbackHandler" 
    	class="com.javelingroup.dataload.CSVHeaderCallbackHandler">
    	</bean>
    Code:
    public class CSVHeaderCallbackHandler implements LineCallbackHandler
    {
    	ExecutionContext jobExecutionContext;
    
    	@BeforeStep
    	public void setJobExecutionContext(final StepExecution stepExecution)
    	{
    		final JobExecution jobExecution = stepExecution.getJobExecution();
    		jobExecutionContext = jobExecution.getExecutionContext();
    	}
    
    	@Override
    	public void handleLine(final String line)
    	{
    		jobExecutionContext.put("header", line);
    	}
    }
    As you can see, handling the line is easy but finding somewhere to save it for later reference is a major drag. After digging around in the forums (eg here and here) it began to look like I should get a handle on the JobContext object and bung the line in there. This would have the added bonus (in theory) of saving the header if my job was interrupted. However, getting hold of the JobContext is a bit of a drag. As well as defining my @BeforeStep I had to register the listener with my job

    Code:
    <batch:job id="ImportProductsJob" parent="simpleJob" >
    	<batch:step id="step1" parent="simpleStep">
    		<batch:tasklet>
    			<batch:chunk 
    			reader="productFileItemReader" 
    			writer="itemWriter" commit-interval="10" >
    			<batch:listeners>
    				<batch:listener ref="jobParamsProvider" />
    				<batch:listener ref="csvHeaderCallbackHandler" />
    				<batch:listener ref="csvHeaderBasedLineTokenizer" />
    			</batch:listeners>
    		</batch:tasklet>
    	</batch:step>
    </batch:job>
    In that job, I also had to register a listener for the bean which would be needing to get the header line back out of the JobContext again in order to interpret individual lines (csvHeaderBasedLineTokenizer). To make that happen was a whole lot more config and code,

    Code:
    	<bean id="csvHeaderBasedLineTokenizer" 
    	class="com.javelingroup.dataload.CSVHeaderBasedLineTokenizer">
    	</bean>
    Code:
    	<bean id="lineMapper"
    		class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
    		<property name="lineTokenizer" ref="csvHeaderBasedLineTokenizer" />
    		<property name="fieldSetMapper">
    			<bean class="com.javelingroup.dataload.product.ProductMapper" />
    		</property>
    	</bean>
    Code:
    public class CSVHeaderBasedLineTokenizer extends DelimitedLineTokenizer
    {
    	private static Logger log = Logger.getLogger(CSVHeaderBasedLineTokenizer.class);
    	ExecutionContext jobExecutionContext;
    
    	@Override
    	public FieldSet tokenize(final String line)
    	{
    		if (!hasNames())
    		{
    			setNames(super.tokenize(
    				(String) jobExecutionContext.get("header")).getValues());
    
    			log.info("Token names successfully picked up from header [ " + //
    					StringUtils.join(names, ", ") + "]");
    		}
    		return super.tokenize(line);
    	}
    
    
    	@BeforeStep
    	public void setJobExecutionContext(final StepExecution stepExecution)
    	{
    		final JobExecution jobExecution = stepExecution.getJobExecution();
    		jobExecutionContext = jobExecution.getExecutionContext();
    	}
    }
    With all that plugged together and wired in the whole thing works, but REALLY!!! What a drag! Is the architecture really this cumbersome? I like having the header string saved in state, but I'd LOVE to find a cleaner way of doing it.

  • #2
    I think there's two main parts of this post:

    1 Obtaining the JobContext is cumbersome.

    I agree, and I think the EL code coming as part of Spring 3.0, which SB 2.1 will be using will solve a lot of those problems. The EL that was put in SB 2.0 was just a hack so that we could get some of the benefits of EL, without having to wait for 3.0 to come out.

    2 Using the first line of a csv file to configure a tokenizer

    Personally, I would read in the first line of the file separately, when the tokenizer is created, such that I could set the various column names on it, rather than attempting to hook into the reader itself. Although, I agree that the various APIs involved make it more difficult than it should be. To be honest, it just hasn't come up very often. This is the kind of thing that I think should be raised as an issue and put to the community to vote on. There are a lot of issues in the 2.1 pipeline contending for resources, so I think we want to be careful about where we allocate our time, to give the community at large the most helpful improvements possible. Having said that, I would like to create a namespace for file reading in 2.1, and would be making improvements to these various apis, so I might be able to sneak in some things like this where possible. (or at least make it easier to do)

    Comment


    • #3
      It would be quite a lot more succinct if you just implement LineCallbackHandler
      in your tokenizer. Then you don't even really need to be a listener - you can just use an instance variable and be step scoped (assuming you get the header line callback on restart, and if not that's easy for us to add).

      Comment

      Working...
      X