Excellent. Looks like we are on the same page now.
Re: HashSet (HashMap?) vs. []. The map is definitely performing better
when you are looking up a value by its key. This may be the case when
we are assembling column descriptors inside the builder. This
operation is done 1 time (ok maybe it is done N times, where N is the
number of columns).
However processing the ResultSet is a different story. There's no
lookup by key. For each ResultSet row we need to apply ALL column
descriptors one by one to get the values out. So with a HashSet/
HashMap we'd have this:
Iterator<ColumnDescriptor> it = map.values().iterator();
while(it.hasNext()) {
ColumnDescriptor column = it.next();
....
}
With a ColumnDescriptor[] we have this:
for(int i = 0; i < length; i++) {
column = columns[i];
}
Both loops are done M times, where M is the number of rows in the
ResultSet. In the worst case scenario, M is much larger than N. In the
first case, we call three extra methods (iterator, hasNext, next) and
create at least one extra object (Iterator). So the secon case is
marginally faster. Now if you multiply that nanosecond or whatever
difference by a few millions, it can become more significant.
So essentially when talking about this refactoring we need to separate
the first step of preparing the columns, and the second step of using
them.
Andrus
On Oct 12, 2009, at 12:38 PM, Evgeny Ryabitskiy wrote:
>> It doesn't matter how this represented *inside* the builder class, as
>> builder is used only once per query. On the other hand, coming out
>> of the
>> builder it must be optimized, as access to the column descriptors
>> array is
>> performed N*M times during each result set processing, where N is
>> the width
>> of the result set, and M is its length. I.e. it can be a very large
>> number
>> (up to tens or hundreds of millions calls). Every small
>> optimization matters
>> here.
>
> So.. I was talking exactly about optimization... HashedSet array can
> be faster cause we perform several scans over whole array of
> ColumnDescriptors. And safety cause we don't get duplicates for
> Columns. And.. I didn't get you position about this idea
>
>> This is something I don't know. We need to check about a dozen of
>> drivers
>> from different vendors that we support to verify that. This is just
>> a getter
>> in the interface. Implementors could've made it anything.
>
>
> I have looked through JTDS drivers (not a dozen but a least one).
> ResultSet has all information about columns (just private final
> ColInfo[] columns).
> When getMetaData performed - constructs new Object that has reference
> to array of columns from ResultSet .
> Looks like there is no problem with JTDS.
>
>
>>> The problem that if we don't set ResultSetMetadata like in current
>>> (trunk) version, without ResultSetMetadata we don't know all
>>> columns..
>>
>> Not true. We don't know all the columns for SQLTemplate queries.
>> For all
>> other types of queries we DO know all the columns, as Cayenne
>> generates SQL
>> from scratch for those queries. I think this one place is where we
>> have the
>> biggest mismatch in our views of the implementation.
>
> ah... now I see. You are right that was a mismatch in our views. I
> will work on it in the evening.
>
>> Another thing to check here is actually reading column data from
>> returned ResultSetMetadata, as lazy
>> resolving of it can be postponed a step further.
>
> Again in JTDS it's just a array of ColInfo (like our
> ColumnDescriptor), it's passed to RowSet through constructor from
> protocol implementation.
>
>
> Evgeny Ryabitskiy.
>
This archive was generated by hypermail 2.0.0 : Mon Oct 12 2009 - 07:00:42 EDT