In a
recent article I discussed some potential performance issues that result
from doing a case sensitive join on a column that is not case sensitive and
offered one alternative. This time I'm going to look at whether it would make
sense to change the collation on the column to case sensitive. You definitely
should read the earlier article before continuing.
I've started by setting the column to case insensitive. Running the following
two queries confirms that it is working correctly. The first query returns one
row, the second query returns zero rows.
select * from instantforum_members where fullname='andy warren' select * from instantforum_members where fullname='Andy warren'
Good. I made a new table that contained just an identity column as a primary
key and the fullname column, with the collation for the fullname column set to
case insensitive. I then ran a simple query to see what would happen.
select top 10 * from instantforum_members m inner join andytemp a on m.fullname = a.fullname
This results in an error:
This is a big problem. If you change the collation on the column and you have
existing queries that already join to it you may be breaking quite a few things.
Changing the query to include a collation (sensitive or insensitive as you
prefer) allows the query to run normally.
select top 10 * from instantforum_members m inner join andytemp a on m.fullname = a.fullname collate sql_latin1_general_cp1_ci_as
select top 10 * from instantforum_members m inner join andytemp a on m.fullname = a.fullname collate sql_latin1_general_cp1_cs_as
Or does it? I set one row to have the value 'andy warren', the other to have
'Andy Warren', both changes made to the andytemp table, verified that I had the
same configuration in members (I was experimenting while writing). I ended up
testing four versions of the query:
--#1, sensitive, where on the sensitive column select top 10 a.fullname as afullname, m.fullname as mfullname from instantforum_members m inner join andytemp a on m.fullname = a.fullname collate sql_latin1_general_cp1_cs_as where m.fullname='andy warren' --#2, sensitive, where on the insensitive column select top 10 a.fullname as afullname, m.fullname as mfullname from instantforum_members m inner join andytemp a on m.fullname = a.fullname collate sql_latin1_general_cp1_cs_as where a.fullname='andy warren' --#3, insensitive, where on the sensitive column select top 10 a.fullname as afullname, m.fullname as mfullname from instantforum_members m inner join andytemp a on m.fullname = a.fullname collate sql_latin1_general_cp1_ci_as where m.fullname='andy warren' --#4, insensitive, where on the insensitive column select top 10 a.fullname as afullname, m.fullname as mfullname from instantforum_members m inner join andytemp a on m.fullname = a.fullname collate sql_latin1_general_cp1_ci_as where a.fullname='andy warren'
Care to guess how many rows each query returns?
- #1 1 row
- #2 2 rows
- #3 2 rows
- #4 4 rows
Column level collations give me a headache. We haven't even gotten back to my
original focus which was making the query perform better, what I see here is
that if I did change the collation I could get one of 3 possible result sets
back depending on how I modified the queries.
I've looked at it three times, the headache is getting worse. How can I
finish the article if I don't know why I get three different answers! Ever
notice that growing is painful?
How to figure out what is going on? I go back to look at the query plans as a
starting point. Here is what the cost of each was:
- #1 1.15
- #2 .00994
- #3 .00994
- #4 1.18
Now let's dive into each.
Query #1 only returns one row because there is only one row in members that
matches the where clause and it has only one match in andytest. It's generating
an index scan against andytest.
Query #2 matches two rows (I think) because andy warren = andy warren and
Andy Warren = Andy Warren and the where clause is applied to a case insensitive
column. The results of the query are shown below.
Query #3 is a case insensitive join that I expect to return 4 rows - the
product - but only two are returned, apparently because the where clause
restricts the member table to one row after the join is applied.
Query #4 is an insensitive join and the where applied on the other side, and
it does return the expected four rows. It also results in an index scan against
members, a result of having to convert the case sensitive column for the join.
Am I deciphering that correctly? I'm open to other interpretations.
As far as performance, the combination of the collation and where you apply
the where filter make all the difference. If the optimizer can restrict the
results up front by filtering against a column that doesn't require applying a
different collation it's fast. If it has to apply the collation and then filter,
it's slow.
To wrap up, applying the collation to the column solves one problem but it
potentially creates another (note that this is only when you are mixing
collations) in that where and how you filter makes a huge difference in how many
rows you get back. I'm not sure I'd want to try to explaining this or be
responsible for testing all the results. Maybe I'm just used to coding it and it
works out to be the same. Definitely we know that you can't just go changing the
collation without being prepared to fix everything that joins to it.
I've still got an idea or two to pursue on this, we'll take that up next
time.
Part 3 is available.