Announcement Announcement Module
Collapse
No announcement yet.
Reading Multiple Files from HDFS using MultiResourceItemReader and HdfsItemReader Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading Multiple Files from HDFS using MultiResourceItemReader and HdfsItemReader

    Hi,

    I have a step in my Spring Batch job in which I have to read multiple files within a directory in HDFS. I am using MultiResourceItemReader as my reader and have added HdfsItemReader as the delegate. Currently, I am using Spring Hadoop 1.0.0.RC1 release and Spring Batch 2.1.9.RELEASE.

    Below is my configuration for MultiResourceItemReader bean

    Code:
    <bean id="multiResourceReader" 
    class="org.springframework.batch.item.file.MultiResourceItemReader" 
    scope="step">
    	    <property name="resources" value="hdfs://${hdfs_dir}/*.txt" />
    	    <property name="delegate" ref="hdfsItemReader" />
                <property name="strict" value="true"/>
    </bean>
    The problem is that even though there are files that exist inside the specified directory, the step is failing with the following error


    Code:
    java.lang.IllegalStateException: No resources to read. Set strict=false if this is not an error condition

    Does anyone know why, even though there are resources to read, it is not able to pick it up?

    Also one more question I saw that in the new release 1.0.0.RC2, HdfsItemReader no longer exists since the batch package has been completely removed. What should be used as a replacement to HdfsItemReader if I use 1.0.0.RC2?

    Thanks!

  • #2
    Make sure you're registering the hdfs URL prefix otherwise hdfs will not be understood. You can do this through the file-system element. Additionally make sure to specify the host and the port as sometimes these are required.
    The HdfsItemReader was removed as it didn't add any value - one can use the MultiResourceItemReader just fine.

    Additionally, to allow hdfs resources to be loaded through a ResourceLoader, one can use CustomResourceLoaderRegistrar under the fs package.
    It's a generic class but alongside the hdfs-resource-loader, it allows resources to be resolved from the hdfs space instead of the just the class space:
    Code:
    <bean id="customRL" class="org.springframework.data.hadoop.fs.CustomResourceLoaderRegistrar" p:loader-ref="hadoopResourceLoader" />
    
    <!-- batch is read from hdfs:/ space by default -->
    <bean id="mrir" class="org.springframework.batch.item.file.MultiResourceItemReader" p:resources="/batch/*" ... />

    Comment


    • #3
      Thanks for the reply.

      I tried to use CustomResourceLoaderRegistrar like you mentioned but now I am getting BeanCreationException in multiResourceReader.

      Code:
      org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'scopedTarget.multiResourceReader' defined in URL: Initialization of bean failed; nested exception is java.lang.NullPointerException
      My hadoopconfiguration is as follows

      Code:
      <hdp:configuration resources="classpath:/core-site.xml">
      	fs.default.name=${hdfs_address}
      </hdp:configuration>	
      
      <hdp:resource-loader id="hadoopResourceLoader" />
      
      <bean id="customResourceLoader"
      	class="org.springframework.data.hadoop.fs.CustomResourceLoaderRegistrar" 
              p:loader-ref="hadoopResourceLoader" />
      The value of hdfs_address is

      Code:
      hdfs_address=hdfs://localhost:9000
      My MultiResourceItemReader configuration is as follows

      Code:
      <bean id="multiResourceReader" 
      class="org.springframework.batch.item.file.MultiResourceItemReader" 
      scope="step">
      	    <property name="resources" value="${hdfs_address}/${hdfs_dir}/*.txt" />
      	    <property name="delegate" ref="hdfsItemReader" />
                  <property name="strict" value="true"/>
      </bean>
      Am I missing something that is causing the BeanCreationException?

      Comment


      • #4
        It looks like it has something to do with the scope="step" in Spring Batch - probably something wrong with the proxying process.
        Try first without the scope and then add it in - speaking of which, do you actually need it?

        Comment


        • #5
          I had added the scope="step" because one of the requirements that I still have to implement is to get the hdfs_dir property from the job execution context.

          Comment


          • #6
            I understand but to clarify the problem it helps to take it step by step since the NPE seems to come from the proxying mechanism not so much the HDFS resource.

            Comment

            Working...
            X