Announcement Announcement Module
Collapse
No announcement yet.
Pooling to read random access files Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pooling to read random access files

    Hi all,
    I'm putting the finishing touches on a large application we've written. We're taking legacy data from mainframe systems and pushing it into JMS Queues for processing. This approach with JBoss and clustered MDB's works great. However we've hit a bottleneck. The input is in the form of enormous text file, 5GB + on average. We have a POJO jar that is executed, and transmits the data to Jboss MQ via the Spring Framework. Since our node that runs the POJO project has 4 processors and 8 GB of ram, I would like to improve the performance of this component of the application. I was thinking that partitioning the file, then using a RandomAccessFile to read each partition in its own thread would significantly improve our performance. Does the Spring Framework offer some sort of pooling interface I could implement? This way I could set my pool size in my configuration. Use that value in my partitioning, then pass this information off to objects in the pool? I would prefer to avoid writing my own Threading code if this has already been implemented. It would be much easier and less bug prone than implementing my own pools. Any advice that could be provided would be greatly appreciated.

    Thanks,
    Todd

  • #2
    Sorry, it is not absoluitely clear what exactly you want to pool. Threads?
    Then read chapter Spring Reference, chapter 23.4. "The Spring TaskExecutor abstraction".

    Regards,
    Oleksandr


    Originally posted by tnine View Post
    Hi all,
    I'm putting the finishing touches on a large application we've written. We're taking legacy data from mainframe systems and pushing it into JMS Queues for processing. This approach with JBoss and clustered MDB's works great. However we've hit a bottleneck. The input is in the form of enormous text file, 5GB + on average. We have a POJO jar that is executed, and transmits the data to Jboss MQ via the Spring Framework. Since our node that runs the POJO project has 4 processors and 8 GB of ram, I would like to improve the performance of this component of the application. I was thinking that partitioning the file, then using a RandomAccessFile to read each partition in its own thread would significantly improve our performance. Does the Spring Framework offer some sort of pooling interface I could implement? This way I could set my pool size in my configuration. Use that value in my partitioning, then pass this information off to objects in the pool? I would prefer to avoid writing my own Threading code if this has already been implemented. It would be much easier and less bug prone than implementing my own pools. Any advice that could be provided would be greatly appreciated.

    Thanks,
    Todd

    Comment


    • #3
      If the bottleneck is the I/O operation necessary to read the file then I guess it will not help trying to read the file with multiple threads. I would predict that parallel accesses to the file would even reduce I/O performance (if the filesystem does allow that, in any case).

      Partitioning, however, seems to be a good idea. In any case you should try to keep only small fragments of the file in memory at any point in time. Otherwise it is the memory management and the garbage collector which will throttle processing.

      Regards,
      Andreas

      Comment


      • #4
        You might want to take a look at the spring batching framework. It is designed for this kind of thing. You might even get the components you want from there.

        If there is a natural way to partition the file and ordering doesn't matter, then I would definitely partition the file. But, reading that much file does not usually take that long. On my laptop, I can read 200 Mb in 12 seconds using Java.

        Comment


        • #5
          Originally posted by ccanning View Post
          You might want to take a look at the spring batching framework. It is designed for this kind of thing. You might even get the components you want from there.
          Interesting. I'll have a look at that.

          Originally posted by ccanning View Post
          But, reading that much file does not usually take that long. On my laptop, I can read 200 Mb in 12 seconds using Java.
          Time of reading is always dependent on the filesystem and how you are connected to it. And I think a problem might be continually reading a lot of data. Especially here it might be important to keep the memory footprint low (e.g. by partitioning).

          Comment


          • #6
            If you used the nio framework and the buffers correctly, your use of memory on reading the file is bounded unless you create and keep objects in memory. With a good batching framework or design, this really shouldn't be the case. (I am not making any references to what you may already have, just a general statement).

            Comment


            • #7
              Originally posted by ccanning View Post
              If you used the nio framework and the buffers correctly, your use of memory on reading the file is bounded unless you create and keep objects in memory. With a good batching framework or design, this really shouldn't be the case. (I am not making any references to what you may already have, just a general statement).
              I agree. It's just that I have already seen cases, where memory consumption on processing large and/or numerous files has caused problems. So I just wanted to point out that increasing I/O performance might not be everything which needs to be considered.

              Regards,
              Andreas

              Comment


              • #8
                Thanks for the replies guys. This kind of got sidelined since we have a stable process, but now I'm back to working on this.


                Basically, the file I'm reading in has a custom markup we're written. Essentially we just have an opening {{statement}} and a closing {{/statement}} tag. I'm sure you noticed the similarity to XML. We parse it very similar to a SAX parser. However for the read operation I mentioned we simply read from one opening statement tag to the closing tag into a buffer, then push that as a text messages on to the queue. Then the buffer is cleared and re-filled.

                The biggest bottleneck is not the file system itself, but the network connectivity to the primary JBoss MQ node and queuing. This goes back to my initial question about pooling. I need to do some tests, but I'm not sure if it would be more efficient to pool my data handler to asynchronously queue the data or if it would be more efficient to partiton the input file. If I partition the file, I have to do the partitioning before I start reading the file. If I use multiple async data handlers (data queuers) it would be a configuration change.

                I tried to read the user guide on the spring batch website, but the link is broken (as of 11:18 pm NZST, GMT +12, 4-8-208). How stable is spring batch? My client can get sued for mis-reporting information, so this process needs to be solid.

                Comment


                • #9
                  I you sure that file partitioning would improve your connectivity issues?
                  Are you sure that works in MQ in the optimal way?
                  E.g., is it possible the you constantly open and close connection to MQ (it is typica bottleneck cause)? If so try to pool this connections.
                  Originally posted by tnine View Post
                  Thanks for the replies guys. This kind of got sidelined since we have a stable process, but now I'm back to working on this.


                  Basically, the file I'm reading in has a custom markup we're written. Essentially we just have an opening {{statement}} and a closing {{/statement}} tag. I'm sure you noticed the similarity to XML. We parse it very similar to a SAX parser. However for the read operation I mentioned we simply read from one opening statement tag to the closing tag into a buffer, then push that as a text messages on to the queue. Then the buffer is cleared and re-filled.

                  The biggest bottleneck is not the file system itself, but the network connectivity to the primary JBoss MQ node and queuing. This goes back to my initial question about pooling. I need to do some tests, but I'm not sure if it would be more efficient to pool my data handler to asynchronously queue the data or if it would be more efficient to partiton the input file. If I partition the file, I have to do the partitioning before I start reading the file. If I use multiple async data handlers (data queuers) it would be a configuration change.

                  I tried to read the user guide on the spring batch website, but the link is broken (as of 11:18 pm NZST, GMT +12, 4-8-208). How stable is spring batch? My client can get sued for mis-reporting information, so this process needs to be solid.

                  Comment


                  • #10
                    Hi Alo, and thanks for your input. I have implemented a thread pool with futures simailar to this post.


                    http://forum.springframework.org/showthread.php?t=25689

                    I needed the futures in order to wait until all executors were complete before my application exists.

                    In my testing, I get nearly exponential increases (Around 90% improvement in speed per new partition) in processing just the data reading of file via partitioning. However, I'm still only using a single instance of what an interface I have called a "DataHandler", which is now a bottleneck. All the data handler does is take my StringBuffer of data from a paritioned reader, wrap it in an object, then push the object message onto JBoss MQ. I'm using the following spring configuration for my Queue connection. Is it possible to create a pool of JMS connections? I'm currently using the SingleConnectionFactory, and I couldn't find any examples of using JMS connection pooling in version 2.0.x on the documentation site. Below is my spring configuration for my JMS connection, any suggestions?

                    Code:
                    	<!-- input connection factory.  This defines the jndi properties to connect -->
                    	<bean id="inputJmsTemplate"
                    		class="org.springframework.jms.core.JmsTemplate">
                    		<property name="connectionFactory">
                    			<ref bean="inputQueueConnectionFactory" />
                    		</property>
                    		<property name="defaultDestination">
                    			<ref bean="sendDestination" />
                    		</property>
                    		<!--		Both of the following are required for XA transactions-->
                    		<property name="sessionTransacted">
                    			<value>true</value>
                    		</property>
                    		<property name="sessionAcknowledgeModeName">
                    			<value>CLIENT_ACKNOWLEDGE</value>
                    		</property>
                    	</bean>
                    
                    	<!--	Destination for our messages, just a queue name on the jndi tree.  Connects to the remote jndi-->
                    	<bean id="sendDestination"
                    		class="org.springframework.jndi.JndiObjectFactoryBean">
                    		<property name="jndiEnvironment">
                    			<ref bean="jndiProperties" />
                    		</property>
                    		<property name="jndiName">
                    			<value>queue/StatementInput</value>
                    		</property>
                    	</bean>
                    
                    	<!--This is a wrapper over our remote JNDI object.  Rather than connect a new socket with every message, we open and maintain a single persisten connection.-->
                    	<!--The connection is closed when the program exists.  It reconnects on the event of a network failure, however if this occurs during a message push, the transaction-->
                    	<!--will fail-->
                    	<bean id="inputQueueConnectionFactory"
                    		class="org.springframework.jms.connection.SingleConnectionFactory"
                    		lazy-init="true">
                    		<property name="targetConnectionFactory" ref="remoteJMSQueue">
                    		</property>
                    		<property name="reconnectOnException">
                    			<value>true</value>
                    		</property>
                    
                    	</bean>
                    
                    	<!-- The remote  JMS Queue Connection Factory -->
                    	<bean id="remoteJMSQueue"
                    		class="org.springframework.jndi.JndiObjectFactoryBean"
                    		lazy-init="true">
                    		<property name="jndiEnvironment">
                    			<ref bean="jndiProperties" />
                    		</property>
                    		<!--		We need to make sure that we connect to the XA capable JNDI, not the single transaction JNDI-->
                    		<property name="jndiName">
                    			<value>XAConnectionFactory</value>
                    		</property>
                    		<property name="cache">
                    			<value>false</value>
                    		</property>
                    		<property name="proxyInterface">
                    			<value>javax.jms.QueueConnectionFactory</value>
                    		</property>
                    	</bean>

                    Comment


                    • #11
                      Originally posted by tnine View Post
                      I needed the futures in order to wait until all executors were complete before my application exists.
                      Really you may escape without futures - threads i ThreadPoolExecutor (if default ThreadFactory is used) are non-daemon threads, so JVM would not stop till last of them finishes, even if your main thread has finished.
                      Is it possible to create a pool of JMS connections? I'm currently using the SingleConnectionFactory, and I couldn't find any examples of using JMS connection pooling in version 2.0.x on the documentation site. Below is my spring configuration for my JMS connection, any suggestions?
                      You may try to use some of generic pools, e.g. Jakarta commons-pool to wrap connections, http://commons.apache.org/pool/


                      Yes, if you maintain connection, the bottleneck shoud not be in this area.

                      Regards,
                      Oleksandr

                      Comment

                      Working...
                      X