Re: paged query slow when fetching big lists

From: Andrus Adamchik (andru..bjectstyle.org)
Date: Tue Jun 26 2007 - 14:32:27 EDT

  • Next message: Marcin Skladaniec: "Re: paged query slow when fetching big lists"

    Hi Marcin,

    I have good news (I think). Fetching just id columns inside the
    IncrementalFaultList indeed speeds things up significantly. I just
    committed the change to optimize SelectQueries to do just that.
    Please let me know how does it work for you.

    Now the profiling details...

    * I profiled on Derby and MySQL. In both cases fetching a table with
    25 columns and 100000 rows took between 3-4 seconds (not as long as
    in your case, but what's important is relative times I guess)
    * YourKit clearly showed the bottleneck: ~95% of the 'fillIn' method
    is spent in the driver code, rewinding the result set (i.e. brining
    the data from db to the client).
    * After my change the query time went down to 0.2-0.5 second (0.8 if
    you consider the second query needed to fault the first page). Not bad!
    * ResultSet reading still remained a bottleneck, but it became faster
    in absolute terms. And now finally Cayenne-related code (such as
    DataRow creation) started to show up on the radar (e.g. DataRow
    constructor taking 3% of the 'fillIn' method time).

    Andrus

    On Jun 26, 2007, at 10:55 AM, Marcin Skladaniec wrote:

    > Hi Andrus !
    > Many thanks for that !
    > Marcin
    >
    > On 26/06/2007, at 5:39 PM, Andrus Adamchik wrote:
    >
    >> Hi Marcin,
    >>
    >>> 1) SelectQuery(Student.class) with page size 10 takes 30-50 seconds.
    >>> 2) SQLTemplate(Student.class, "SELECT #result('id' 'int') from
    >>> STUDENT") without paging takes 100 sec.
    >>> 3) SQLTemplate(Student.class, "SELECT #result('id' 'int') from
    >>> STUDENT") with page size 10 takes 5 sec.
    >>> 4) SQLTemplate(Student.class, "SELECT #result('id' 'int') from
    >>> STUDENT") with page size 10 and fetching row data rows takes 4.5
    >>> sec.
    >>
    >> I see you still didn't use profiler, but the data you provided
    >> seems to finally confirm that at least on Derby not fetching all
    >> columns does result in significant speedup (#1 vs. #3). So now it
    >> is a question of implementing the right algorithm for the
    >> IncrementalFaultList.
    >>
    >>> Andrus, you mentioned using addCustomDbAttribute to fetch only
    >>> part of the data. I tried to use addCustomDbAttribute("id") on
    >>> client, it resulted in returning the raw dataRows, is there
    >>> something I can do to fetch faulted objects ?
    >>
    >> We should encapsulate this logic inside IncrementalFaultList on
    >> the server.
    >>
    >>> Our application was designed to use the SelectQuery. If we have
    >>> to change that and use the SQLTemplate instead, there is a lot of
    >>> work for us, including:
    >>
    >> Same thing - the right thing to do is to fix it on the server.
    >>
    >> Let me try to find a spare minute later tonight and implement id-
    >> only fetch. I have some large tables in a MySQL5 so I can test the
    >> performance in a slightly different environment.
    >>
    >> Andrus
    >>
    >>
    >>
    >> On Jun 26, 2007, at 9:26 AM, Marcin Skladaniec wrote:
    >>> Hi
    >>>
    >>> I have done some more profiling and testing.
    >>>
    >>> executing queries on table with >100000 records, directly on
    >>> server (not on client) gave results as listed below:
    >>>
    >>> 1) SelectQuery(Student.class) with page size 10 takes 30-50 seconds.
    >>> 2) SQLTemplate(Student.class, "SELECT #result('id' 'int') from
    >>> STUDENT") without paging takes 100 sec.
    >>> 3) SQLTemplate(Student.class, "SELECT #result('id' 'int') from
    >>> STUDENT") with page size 10 takes 5 sec.
    >>> 4) SQLTemplate(Student.class, "SELECT #result('id' 'int') from
    >>> STUDENT") with page size 10 and fetching row data rows takes 4.5
    >>> sec.
    >>>
    >>> what more, I found that executing the SQLTemplate does allow to
    >>> fault the objects (I sometimes discover the simplest things
    >>> last), so I did try to check how long it takes for the objects to
    >>> be faulted:
    >>> 1) first object on every page (except first) 30-200ms, rest = 0ms
    >>> 2) objects is faulted in 20ms (average)
    >>> 3) on first page first object faulted in 200ms, rest ~20 ms,
    >>> on any following page first object faulted in 30-200ms, rest
    >>> 0ms (interesting that the first page does not seem to be faulted
    >>> at all)
    >>> 4) no point testing.
    >>>
    >>> Also I did check if the resizing of the ArrayList which is
    >>> keeping the results does affect the speed, and it does not.
    >>> (Tried to make the ArrayList initial size = 150,000).
    >>>
    >>> My conclusion is that SelectQuery with paging is usable only for
    >>> fetching less than, say 10,000 records, otherwise the performance
    >>> is to low. With SQLTemplate the performance is much greater. It
    >>> applies to both ROP and 'normal' cayenne, since I made those
    >>> tests on server.
    >>>
    >>> Andrus, you mentioned using addCustomDbAttribute to fetch only
    >>> part of the data. I tried to use addCustomDbAttribute("id") on
    >>> client, it resulted in returning the raw dataRows, is there
    >>> something I can do to fetch faulted objects ?
    >>>
    >>> Our application was designed to use the SelectQuery. If we have
    >>> to change that and use the SQLTemplate instead, there is a lot of
    >>> work for us, including:
    >>> - dealing with adding and concatenating Expressions to the
    >>> SQLTemplate (is there an easy way ?)
    >>> - dealing with declared qualifier (the one set in modeller)
    >>> - possibly more...
    >>> i would really like to avoid all of that, so if you have any
    >>> ideas on how to improve the performance without too much hassle I
    >>> would really appreciate.
    >>>
    >>> Marcin
    >>>
    >>> On 25/06/2007, at 8:31 PM, Marcin Skladaniec wrote:
    >>>
    >>>> Hi Andrus
    >>>> I had not much time to check, but with the fix the 100k records
    >>>> load in 30 instead of 50 seconds. It is some improvement, but
    >>>> not enough. I'll do some more profiling tomorrow and let you know.
    >>>>
    >>>> By the way, we are using netbeans for profiling, the benefit :
    >>>> it is free. I will evaluate the yourkit as we are moving away
    >>>> from netbeans as a development platform.
    >>>>
    >>>> Marcin
    >>>>
    >>>> On 23/06/2007, at 5:38 PM, Andrus Adamchik wrote:
    >>>>
    >>>>> Ari, Marcin --
    >>>>>
    >>>>> going through the code I noticed one inefficiency - the
    >>>>> elements array access is synchronized in 'fillIn' method. Since
    >>>>> 'fillIn' is called from constructor, such synchronization is
    >>>>> unneeded and only slows things down. I just checked a fixed
    >>>>> version to trunk. Could you try it out?
    >>>>>
    >>>>> Andrus
    >>>>>
    >>>>>
    >>>>>
    >>>>> On Jun 23, 2007, at 12:33 AM, Aristedes Maniatis wrote:
    >>>>>> On 22/06/2007, at 11:10 PM, Michael Gentry wrote:
    >>>>>>
    >>>>>>> Marcin, this thread might be of interest to you ...
    >>>>>>>
    >>>>>>> http://mail-archives.apache.org/mod_mbox/cayenne-dev/
    >>>>>>> 200705.mbox/browser
    >>>>>>>
    >>>>>>> Look at the "Paging and SQL queries" thread on May 25.
    >>>>>>
    >>>>>>
    >>>>>> Yes, this is the same project we are working on. I started
    >>>>>> some performance profiling and Marcin has been able now to
    >>>>>> take it much further. What is it about:
    >>>>>>
    >>>>>>> elements.add(it.nextObjectId(entity));
    >>>>>>
    >>>>>> which is so slow? The code gets a little complex at that point
    >>>>>> and we are having difficulty tracing it through to the exact
    >>>>>> performance problem in the underlying code. Is it the speed of
    >>>>>> adding the object id to the Collection or the speed of
    >>>>>> creating an object id itself? 0.5ms doesn't sound slow, but it
    >>>>>> doesn't scale well.
    >>>>>>
    >>>>>> Andrus, I got the impression from the previous thread that you
    >>>>>> suspected something slightly different. That the performance
    >>>>>> problem might be in the fat query itself, but from our tests
    >>>>>> this isn't the case. If I've got this right, the way it works is:
    >>>>>>
    >>>>>> 1. perform regular query to get all columns but return result
    >>>>>> in iterator
    >>>>>> 2. iterate through first page and build full objects
    >>>>>> 3. iterate through other pages and build just objectids (this
    >>>>>> is the slow part for us)
    >>>>>> 4. when another page is fetched perform another query and
    >>>>>> fetch those objects from the DB
    >>>>>>
    >>>>>> So, getting just primary keys from the DB may not be any
    >>>>>> faster if the performance problem is simply in the
    >>>>>> construction of objectIds. If the performance problem is in
    >>>>>> the numerous resizings of the Collection (each time it runs
    >>>>>> out of space, then it is increased by 50% or 100% in size),
    >>>>>> then the solution could be as simple as figuring out the size
    >>>>>> of the iterator and sizing the collection appropriately from
    >>>>>> the beginning.
    >>>>>>
    >>>>>> Any ideas on how to discover the exact cause of the
    >>>>>> performance hit?
    >>>>>>
    >>>>>>
    >>>>>> Ari Maniatis
    >>>>>>
    >>>>>>
    >>>
    >>>
    >>>
    >>
    >
    >
    >
    >



    This archive was generated by hypermail 2.0.0 : Tue Jun 26 2007 - 14:32:55 EDT