Announcement Announcement Module
No announcement yet.
Maintaining partition order Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Maintaining partition order

    I'm processing a series of files, which are named with datestamps. I need the filename because I've got to do some record keeping, so I followed the partitioning approach found in this thread.

    The issue is that I need the files processed in order, but this isn't happening. Although the files are partitioned in order (that is, the partitioned steps are named in the correct order), they're processed out of order (that is, the step IDs are not in the correct order).

    I believe that this is because after the files are partitioned and assigned to a step execution, the step executions are inserted into a hashmap. Of course, we aren't guaranteed any retrieval order out of the hashmap, and that's how the steps are processed in arbitrary order.

    The fix I employed was to create a version of MultiResourcePartitioner that uses LinkedHashMap instead of HashMap in partition(), and a version of SimpleStepExecutionSplitter that uses LinkedHashSet instead of HashSet in split(). This maintains the order.

    Does this sound reasonable?

  • #2
    Originally posted by Aidan View Post
    Does this sound reasonable?
    I don't really understand how process ordering can be important for business reasons, and there is no way that a concurrent system can or should try to guarantee the order. Maybe I misunderstood the requirement. But if you say a LinkedHashMap is useful to you that's fine.


    • #3
      Sorry, I was a bit unclear about the requirements in my original post.

      I'm processing catalog data, which is received daily. So I have a file from Monday, Tuesday, and Wednesday; these aren't really different files, though, they're essentially different versions of the same file. I need to process Monday, then Tuesday, then Wednesday (and in this case I can't simply discard Monday and Tuesday, because we can get partial updates).

      Ideally I process Monday's data on Monday, then Tuesday's on Tuesday. But there are cases where we need to process multiple days' worth in one run (for example, if the client was late getting the data to us).

      Partitioning the files works, now that I've got the partition order sorted out. Also, as you mentioned, I can't process them concurrently, so I use a SyncTaskExecutor.