Announcement Announcement Module
Collapse
No announcement yet.
In memory database or Caching solution Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I'm sure caching will help your situation. It does sound like your current approach is very inefficient however. It would be good to improve that as well as caching. Your cache is limited by memory so unless you have bucket loads of the stuff you are still going to be hitting the database sometimes.

    Comment


    • #17
      Thanks for the reply. The current approach comes with the legacy implementation, and we are now working on a completely new architecture. We are thinking of, possibly, preloading the complete graph into cache (it's ok if a portion of it spills on the disk.) There are some specific requirements that dictate the necessity of that kind of traversals, though. Each element may have to be analyzed before you make a decision which way to go further, etc. Some elements may contain thousands of children and the data from all of them is necessary for the analysis to complete. If these elements are stored in cache, it should be fairly fast.

      Do you know of any good examples of using declarative caching with Spring Modules? I have found some stuff on the net but it's all not in a working condition, obsoilete, perhaps. The latest SM 0.8 distribution does not have examples for caching, for some reason.
      Thanks.

      Comment


      • #18
        Hello,

        do you know all elements that belong to the given graph in advance? If so, your best bet is to read the whole graph into the memory by single DB query and only then operate on it. 300,000 nodes are not so many. Even if size of data assciated with each node are in 1K-2K range your graph would fit in the memory comfortably on each decent machine.

        If you can discover elements only in process of traversing then cache may help as soon as you go over the same element many times. But, in this very case it may (or may not) be better to go with DIY ("do it yourself") rudimentary implementation of the cache - just
        Code:
        HashMap<NodeID,Node> myRudimentaryCache;
        Node getNode(NodeID id) {
            Node result= myRudimentaryCache.get(id);
            if (result==null) result= getNodeFromDB(id);
            return result;
        }
        BTW, how many graphs have traverse simultaneously?

        Regards,
        Oleksandr

        Originally posted by constv View Post
        We are facing a similar issue: we have an application that must perform traversals of very large graphs. There are use cases when we need to traverse up and down a graph structure that contains, potentially, 200,000-300,000 nodes. Today, the app makes a trip to the DB for each node element and retrieves the IDs of the element's children and parents. The application then goes on to retrieve each child or ancestor element from the database. Needless to say, it is extremely inefficient. It works ok so far only because the data volume has not yet reached its projected size. We have been considering caching the data graph (elements and their associations) using in-memory and disk cache. Would EhCache or JCS be suitable for this? Can any of you guys suggest a good approach?

        Also, are there any good examples of using spring-modules-cache - complete and working. SM 0.8 distribution does not seem to contain any...

        Thanks for any advice!
        Regards,
        Oleksandr

        Comment


        • #19
          Thanks Oleksandr. I am indeed planning to read all the nodes with a single query. My original thought was to read everything into a concurrent hashmap, and I have actually implemented that solution. The concern is that the potential number of objects - and their sizes - could create problems. So, we thought, a safer solution would be to use some existing caching engine that would allow to spill the data onto the disk. We could, of course, implement our own, but what's theh point of reinventing the wheel?

          The service that traverses the graph will be accessed by multiple clients, however, there will be only one instance of the graph itself. The service needs to be able to traverse the graph and return subsets of the grapgh (e.g by skipping the nodes that don't satisfy the client's criteria.) All that can done efficiently by storing the elements and their associations in maps, and I already have a prototype working with smaller volumes of data. I was just really tempted to try declarative caching. That would make things much cleaner.

          Comment


          • #20
            Hallo,

            disk cache would be not so much more effective then DB access (you have to remember then production-class databases has their own caches for "hot" database pages/blocks). Memory cache is for sure quite effective, but it would compete with your application for memory. My assumtion is that memory overhead of advanced caches is higher then simple DIY solutions and in your case, with single graph instance, advantages of advanced cache over simpliest DIY solution are not so obvious.

            But - how big are data for single node? Are your concern about memory really well-grounded? As now 4Gb memory is not so big deal, then you my easily spend more then 2Gb just for your graph and this mean 1,000,000 of 2Kb nodes. Seems enough, to satisfy your needs. If it is really so, forget about any kind of caching - DIY or 3rd-party. Do not forget KISS principle

            BTW, I have made similar (to some extent) task almost 20 years ago - on 4Mb 386/486 computers with MS-DOS and C. Nodes data were relatevely small, but number of nodes has reached ~100,000.

            Regards,
            Oleksandr

            Originally posted by constv View Post
            Thanks Oleksandr. I am indeed planning to read all the nodes with a single query. My original thought was to read everything into a concurrent hashmap, and I have actually implemented that solution. The concern is that the potential number of objects - and their sizes - could create problems. So, we thought, a safer solution would be to use some existing caching engine that would allow to spill the data onto the disk. We could, of course, implement our own, but what's theh point of reinventing the wheel?

            The service that traverses the graph will be accessed by multiple clients, however, there will be only one instance of the graph itself. The service needs to be able to traverse the graph and return subsets of the grapgh (e.g by skipping the nodes that don't satisfy the client's criteria.) All that can done efficiently by storing the elements and their associations in maps, and I already have a prototype working with smaller volumes of data. I was just really tempted to try declarative caching. That would make things much cleaner.

            Comment


            • #21
              Originally posted by constv View Post
              Thanks Oleksandr. I am indeed planning to read all the nodes with a single query. My original thought was to read everything into a concurrent hashmap, and I have actually implemented that solution. The concern is that the potential number of objects - and their sizes - could create problems. So, we thought, a safer solution would be to use some existing caching engine that would allow to spill the data onto the disk. We could, of course, implement our own, but what's theh point of reinventing the wheel?

              The service that traverses the graph will be accessed by multiple clients, however, there will be only one instance of the graph itself. The service needs to be able to traverse the graph and return subsets of the grapgh (e.g by skipping the nodes that don't satisfy the client's criteria.) All that can done efficiently by storing the elements and their associations in maps, and I already have a prototype working with smaller volumes of data. I was just really tempted to try declarative caching. That would make things much cleaner.
              I think the most important information has not yet been provided, sorry if I overlooked it:

              Will the graph be modified by the traversing code or by other code?
              Will individual graph nodes be modified in a way that affects the outcome of the decisions of the traversal code?

              Also, declarative caching may not really be suitable for this task, depending on your specific needs, because it caches method invocation results, which might or might not be suitable for this. It can give you a very good response time for the cases for which the calculation results have already been cached. And in case of cache misses you are facing the old problem that getting the information is slow. If you have a good distribution of input parameters in the entire parameter-space, then this will probably not help too much.

              So I would say that in this case, it might be better to have a proper in memory representation of the necessary data, to provide for the worst case, and then if you still have memory to spare, you can start finding out if declarative caching is required to speed up average response time for the most frequently used cases even more.

              Best regards,

              Robert
              Last edited by robvarga; Apr 19th, 2007, 04:07 AM.

              Comment


              • #22
                Thanks Robert. The associations between the nodes can be changed on the fly, added, removed, etc. Also, nodes will be added and removed. It is a fairly complex set of requirements. And there's no way we can cache the complete data for each object. We don't have to. The consideration was to cache the "node" objects that are especially created for traversing only. They would contain only the element's ID and the IDs of its direct associations + some basic data that is necessery to make a decision whether the node must be filtered out of the resulting topology returned. We won't need the complete element data when traversing the graph.

                The method that I am thinking of intercepting is only intended to return the next node object in the graph, not a result of a complex calculation. So that should be ok. However, what I don't like about this is that, ideally, that "node object" should not be exposed to clients of the service via a public method since it is not the actual domain object but rather (should be) a service-specific object that exists to assist the internal traversal "under the hood." So, for that reason, you may be right, declarative method interception may not be a good thing in this case.

                Thanks,
                Constantine
                Last edited by constv; Apr 19th, 2007, 08:00 AM.

                Comment


                • #23
                  Originally posted by constv View Post
                  Thanks Robert. The associations between the nodes can be changed on the fly, added, removed, etc. Also, nodes will be added and removed. It is a fairly complex set of requirements. And there's no way we can cache the complete data for each object. We don't have to. The consideration was to cache the "node" objects that are especially created for traversing only. They would contain only the element's ID and the IDs of its direct associations + some basic data that is necessery to make a decision whether the node must be filtered out of the resulting topology returned. We won't need the complete element data when traversing the graph.

                  The method that I am thinking of intercepting is only intended to return the next node object in the graph, not a result of a complex calculation. So that should be ok. However, what I don't like about this is that, ideally, that "node object" should not be exposed to clients of the service via a public method since it is not the actual domain object but rather (should be) a service-specific object that exists to assist the internal traversal "under the hood." So, for that reason, you may be right, declarative method interception may not be a good thing in this case.

                  Thanks,
                  Constantine

                  Hi Constantine,

                  yes, it was a bit unclear, under completely cached data-set I mean caching the portions of data required for the decision (parent/children/other relevant attributes) but this is to be cached for all nodes, so that you do not have slow response times because of having to consult the database, particularly if you are navigating the links step-by-step (multiple invocations to the db is even slower than a single invocation to the db returning all the data which would be returned in multiple steps).

                  Of course I don't know the exact criteria to the decisions to be done during traversals, and neither the accepted environment.

                  However, if you have client threads which are traversing the data while the data hierarchy is modified, then you will be in a situation where you have to map out every contingency related to what happens if data is modified while you are traversing the nodes which are being changed right now. You will probably have very nasty race conditions.

                  Also, if this system is to be clustered and threads must traverse the hierarchy on multiple servers, then it will be very tricky.

                  Would you care to share more about the environment and about the decisions to be made during the traversal and about the data sizes?

                  Best regards,

                  Robert

                  Comment


                  • #24
                    Originally posted by robvarga View Post
                    Hi Constantine,

                    yes, it was a bit unclear, under completely cached data-set I mean caching the portions of data required for the decision (parent/children/other relevant attributes) but this is to be cached for all nodes, so that you do not have slow response times because of having to consult the database, particularly if you are navigating the links step-by-step (multiple invocations to the db is even slower than a single invocation to the db returning all the data which would be returned in multiple steps).

                    Of course I don't know the exact criteria to the decisions to be done during traversals, and neither the accepted environment.

                    However, if you have client threads which are traversing the data while the data hierarchy is modified, then you will be in a situation where you have to map out every contingency related to what happens if data is modified while you are traversing the nodes which are being changed right now. You will probably have very nasty race conditions.

                    Also, if this system is to be clustered and threads must traverse the hierarchy on multiple servers, then it will be very tricky.

                    Would you care to share more about the environment and about the decisions to be made during the traversal and about the data sizes?

                    Best regards,

                    Robert
                    There is no clustering, at least right now (but who knows what may happen in the future.) There is, of course an issue of synching the cache with the database. When client threads modify elements and associations, they would have to modify them in the DB first, and then the cache would have to be synched with the database. The number of concurrent clients is always very small, just a few at a time maximum. However, there are still potential issues, as you point out. I find it a challenging task, really, because of the supposed requirement for being able to traverse everything up and down at any time, and allowing for on the fly modifications of the graph. It works ok with relatively small sets of data. But they are planning to potentially handle hundreds of thousands of elements with multiple associations (including historical, outdated associations.) The amount of data can be huge. I wonder if there are any known solutions/approaches to large grapg traversing tasks like this one... In addition, we can't get very specific requirements from the business. It seems like they have all these things in their heads and want everything to be fast and flexible... I don't blame them.

                    Comment


                    • #25
                      Originally posted by constv View Post
                      There is no clustering, at least right now (but who knows what may happen in the future.) There is, of course an issue of synching the cache with the database. When client threads modify elements and associations, they would have to modify them in the DB first, and then the cache would have to be synched with the database. The number of concurrent clients is always very small, just a few at a time maximum. However, there are still potential issues, as you point out. I find it a challenging task, really, because of the supposed requirement for being able to traverse everything up and down at any time, and allowing for on the fly modifications of the graph. It works ok with relatively small sets of data. But they are planning to potentially handle hundreds of thousands of elements with multiple associations (including historical, outdated associations.) The amount of data can be huge. I wonder if there are any known solutions/approaches to large grapg traversing tasks like this one... In addition, we can't get very specific requirements from the business. It seems like they have all these things in their heads and want everything to be fast and flexible... I don't blame them.
                      Then, if there are no specific requirements for modifications and decisions, that means that you cannot pinpoint smaller parts of your data in which the changes from a specific client thread can be confined, and then all you can do for handling concurrency is implement a multiple-reader-single-writer pattern around your code.

                      Once the requirements are specified, then you might be able to break down the change and the decision logic into smaller parts of logic which, if the small parts are executed atomically, allow you to compose your larger logic of them in a way that you need not implement synchronization on larger parts of the data-set then these atomic parts need.

                      In this case you can probably write an algorithm in which at most one lock is held by any thread at any point in time, or if it is possible in your system to have atomicity implemented without a lock (unlikely), then you can do without locks altogether.

                      Comment


                      • #26
                        Hello, probably, you may shortly explain (without breaking confidentiality) business nature of your task?

                        BTW, what to do if changes from one thread cone into conflict with the traversing that other threads perform? I mean not just race condition, but sitation when changes made by one thread invalidate decisions made by another thread. E.g. one thread isolate the node that is current to another thread (i.e. delete all its neighboughrs).

                        Or less obvious situation - result of travesing that corresponds to the state of the main graph that never has existed. E.g. subgraph that contains nodes that never have co-existed in main graph Imagine subgraph that contains (among other) A and B nodes, but node A was deleted from main graph after traversal thread has selected it for inclusion into resulting subgraph and node B was added to main graph after deletion of node A. Has this subgraph business meaning?

                        Regards,
                        Oleksandr

                        Comment


                        • #27
                          Originally posted by al0 View Post
                          Hello, probably, you may shortly explain (without breaking confidentiality) business nature of your task?

                          BTW, what to do if changes from one thread cone into conflict with the traversing that other threads perform? I mean not just race condition, but sitation when changes made by one thread invalidate decisions made by another thread. E.g. one thread isolate the node that is current to another thread (i.e. delete all its neighboughrs).

                          Or less obvious situation - result of travesing that corresponds to the state of the main graph that never has existed. E.g. subgraph that contains nodes that never have co-existed in main graph Imagine subgraph that contains (among other) A and B nodes, but node A was deleted from main graph after traversal thread has selected it for inclusion into resulting subgraph and node B was added to main graph after deletion of node A. Has this subgraph business meaning?

                          Regards,
                          Oleksandr
                          That sounds like a race condition for me: problems resulting from interleaving multiple threads executing read or write (at least one) operations on the same data non-atomically when the code is written to assume that the execution of that code path is atomical/single-threaded.

                          Comment


                          • #28
                            Originally posted by al0 View Post
                            Hello, probably, you may shortly explain (without breaking confidentiality) business nature of your task?

                            BTW, what to do if changes from one thread cone into conflict with the traversing that other threads perform? I mean not just race condition, but sitation when changes made by one thread invalidate decisions made by another thread. E.g. one thread isolate the node that is current to another thread (i.e. delete all its neighboughrs).

                            Or less obvious situation - result of travesing that corresponds to the state of the main graph that never has existed. E.g. subgraph that contains nodes that never have co-existed in main graph Imagine subgraph that contains (among other) A and B nodes, but node A was deleted from main graph after traversal thread has selected it for inclusion into resulting subgraph and node B was added to main graph after deletion of node A. Has this subgraph business meaning?

                            Regards,
                            Oleksandr

                            There is an independent process that runs on a scheduled basis and discovers the elements that are - if found new - added to the database. Elements have topology-like connections between them. Each element may have a large amount of data associated with it, most of it but that data is irrelevant for traversals but some may be used to decide on the traversal paths. It is ok to have a slightly stale snapshot of the data on the client. The client can refresh its data when it chooses so. It is also ok for one thread to change the graph while another thread is traversing it. At least, for now, it is acceptable.

                            Comment


                            • #29
                              Originally posted by robvarga View Post
                              That sounds like a race condition for me: problems resulting from interleaving multiple threads executing read or write (at least one) operations on the same data non-atomically when the code is written to assume that the execution of that code path is atomical/single-threaded.

                              Yes, in a broad sense it is a race condition if consider a whole graph as a single element.

                              Comment


                              • #30
                                about in memory database like dataset

                                Hello we developing an application, using Spring ibatis connect to mysql.
                                The application hava a function which judgement ip address from city, to active different event.

                                in asp.net it's easy, cache dataset in memory, using query string query "city" rows.but in java we engage performance problem. Is there any object like dataset in .net, to storage data in memeory and be queried, and using JCS cache it. Or other way to complete this funciton.

                                Any suggestions will be very helpful

                                Thanks

                                Comment

                                Working...
                                X