Announcement Announcement Module
No announcement yet.
Help for partitionning data strategy Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help for partitionning data strategy


    i'm new with Spring Batch.

    what i need is to extract some data from an Oracle database, and to put it in a CSV file.
    The datas to get need many join, and the tables size is about 5 millions (but only a small fragment will be exported each time).
    Because doing only 1 query is nearly impossible, and i guess it would be costly, i was thinking of using the "Driving Query Based ItemReaders" pattern found in Spring Batch documentation :

    - Reader : getting a list of ID representating the objects i will need to get
    - Processor : i would make some call to DAO to get the data i need from some tables
    - Writer: write the result of processor.

    So far, it's ok for me.

    But i want to do partition, and i have some difficulties to do so.

    My main problem is : how i gonna have to split the results of my reader ?
    • ColumnRangePartitioner would not be efficient, because the id will not be uniformly distributed.
    • by using pagination whith page number as discriminant (i don't know if it's possible)?
    • Maybe i can do some prior step to read/write my list of id, putting it in flat files, then read one file by partition ?
    • Any suggestion ???


    ps : Sorry for my poor english.

  • #2
    i may have found a logical field for doing my partition
    But i'm curious to know if a solution using pagination for partitionning could be done.

    Anyway, is there a way to write on the same file with multi-threading ?


    • #3
      Two answers:
      1. I'd implement a custom Partioner that executes the query required to get the list of ids per partition.
      2. No. Writing to a flat file is a single threaded exercise for a couple reasons:
      1. It typically isn't any faster to access disk in a multithreaded manor (I/O gets swamped).
      2. We wouldn't be able to manage restartability.