Announcement Announcement Module
Collapse
No announcement yet.
Neo4j Insert performance Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Neo4j Insert performance

    Hi, I am wondering how and if I can speed up my db inserts. At this pace I am looking at 30 days to complete!

    I was running this via spring batch but removed that from this test case.

    It is splitting up a million or so XML records and inserting it into the graph. One record has 15 sub elements types with an average of 30 sub elements. I then link to another two Nodes for year and each record gets added to a another node that represents the source. Some sources (150 odd) contain a few records, some tens of thousands.

    I have set a large amount of memory 2GB for the standalone neo4j database.

    The code is as follows with timings shown

    Code:
    parseXML - {~1ms}
    neoRecord = new Record(...); - {~1ms}
    
    y = yearRepository.findAllByPropertyValue("year", 2012).singleOrNull(); {~30ms}
    if (y == null) {
        y = new Year(2012);
        yearRepository.save(y); {not sure how long this takes - doesn't matter it rarely happens}
    }
    neoRecord.setYear(y); {~1ms}
    
    Source s= sourceRepository.findAllByPropertyValue("sourceId", foo).single(); {~150ms}
    s.addRecord(neoRecord); {~700-750ms}
    
    //here i create the (up to) 15 sub elements and add them to the record -  {~1ms suprisingly fast including some more xml bits}
    ElementType[] ets = xml.getETArray();
    Set<SubElement> subElements = new HashSet<SubElement>();
        			for (ElementType et : ets) {
        				SubElement c = new SubElement();
        				c.setFoo(fooString);
        				c.setRecord(neoRecord);
        				subElements.add(c);
        			}
        			neoRecord.setSubElements(subElements );
    (repeated 15 times)
    //
    
    recordRepository.save(neoRecord); {~3000-4000}
    I tried various indexing combinations but couldn't speed this up. It is mainly the save that is the time consuming part although finding the source and adding the records could do with speeding up.

    I would appreciate any tips or suggestions. A few ms here and there will make a big difference.

    Regards,

    Mark


    Code:
    @NodeEntity
    public class Year {
    	
    	@Indexed(unique=true)
    	String year;
     ...
    
    }
    
    @NodeEntity
    public class Record {
    @Indexed
    private Long id;
    }
    
    @NodeEntity
    public class Source {
    
    @Indexed
    	@RelatedTo(direction=Direction.OUTGOING, type=RelationshipTypes.CONTAINS)
    	Set<Record> records = new HashSet<Record>(16, 1f);
    
    }

  • #2
    I have done some further testing and tried batch saving. I add each record to the source and cascade the save at the end of the collection of xml records I am processing.

    i.e. just a lot of adds and one final save sourceRepository.save(source);

    Here are the results...

    10 records - 27.5s - 190 nodes = 7n/s
    20 records - 48s - 290 nodes = 6 n/s
    100 records - 197s - 1008 nodes = 5 n/s
    800 records - 1780s - 8581 nodes - 4.8 n/s

    I am aware of the batch functionality, but unfortunately I have transactional requirements.

    Hope this doesn't sound too negative, I am hugely impressed with spring data and neo4j. Read performance is blistering and it is allowing me to perform queries that would destroy even the most carefully crafted relational database.

    Comment


    • #3
      Spring Data Neo4j is not designed for massive data inserts, there are some approaches to use Neo4j BatchInserter on a local database to insert the data. This one is able to insert one million nodes per second.

      I would like you to test to use a local embedded database and measure the difference.

      But using the Batch-Inserter (which is non-transactionally though) would be the fastest way:
      (see https://groups.google.com/d/topic/ne...2YA/discussion for some code but read the whole thread).

      I think it makes sense to offer this functionality in SDN itself, I created an issue to track it: https://jira.springsource.org/browse/DATAGRAPH-231

      Cheers

      Michael

      Comment


      • #4
        The repository.save() also just iterates over your input, so it is just a convenience method, you're right that this should change for the REST case.

        Comment


        • #5
          Michael thanks very much.

          I just switched back to the embeded version and it is much faster.

          800 records - ~3s - 1008 nodes = 336n/s

          Presumably I can import in embeded mode then switch to rest after the import? If so this is more than workable.

          I am tempted to try the batch inserter too. Maybe I can find a way of synchronising my data sources without transactions. Inserting a million a second sounds a lot of fun..

          Regards,

          Mark

          Comment


          • #6
            Sure as I said, you can copy the whole store to the server (content into data/graph.db).

            If you want to give the Batch-Inserter a try please keep in mind to create the additional data-structures for SDN (type-indexes and your indexes).

            Cheers

            Michael

            Comment


            • #7
              Tero even wrote a blog post about it, see here: http://code.paananen.fi/2012/04/05/n...ata-for-neo4j/

              Comment

              Working...
              X