January 18, 2011 at 10:01 am
I have a query design question related to using CASE statements vs. INNER JOINs and the effect on performance. The goal is to create a query that returns "things" and some of their associated properties. These properties have a very limited number of potential values (typically a maximum of 3 possible values). The "thing" table will have roughly 500,000 rows in it. And my typical select statements will be required to INNER JOIN on at least a few other tables for business purposes.
My first option is to create lookup tables that contain each of these possible values and then INNER JOIN to them in my SELECT statements. For some of our larger queries, this could results in as many as 5-10 joins to these lookup tables. A foreign key would be stored in the "thing" table represented as a TINYINT and the PK would be stored in the lookup table along with the text description that would be returned in the SELECT statement.
The second option is to use a CASE statement to decide which text to display to the user. This would eliminate the need for a lookup table and an INNER JOIN. The CASE statement would check the TINYINT stored within the "thing" table and decide which text to return. One drawback of this would be if an additional option were added at some point down the road. However, this is not likely.
The third option is to store the actual text within the column instead of a TINYINT that represents the text. This would mean I could eliminate the CASE statement and just return the value from the column itself. I assume this is the best option in terms of performance.
--OPTION 1 (INNER JOINS)
SELECT
A.ThingName,
B.RefName1,
C.RefName2,
D.RefName3
FROM
Thing A
INNER JOIN FirstLookupTable B ON A.SomeCol1 = B.SomeCol1
INNER JOIN SecondLookupTable C ON A.SomeCol2 = C.SomeCol2
INNER JOIN ThirdLookupTable D ON A.SomeCol3 = D.SomeCol3
--OPTION 2 (CASE STATEMENTS)
SELECT
A.ThingName,
CASE A.SomeCol1 WHEN 0 THEN 'Black' ELSE 'White' END AS [RefName1],
CASE A.SomeCol2 WHEN 0 THEN 'Open' ELSE 'Closed' END AS [RefName2],
CASE A.SomeCol3 WHEN 0 THEN 'Up' ELSE 'Down' END AS [RefName3]
FROM
Thing A
--OPTION3 (TEXT STORED IN THING TABLE)
SELECT
A.ThingName,
A.RefName1,
A.RefName2,
A.RefName3
FROM
Thing A
January 18, 2011 at 10:14 am
If you don't mind a little research I did from another thread, a quick test I did:
http://www.sqlservercentral.com/Forums/Topic1046084-392-1.aspx#bm1046202
That said...
Option 3 is the worst performer if you ever change the values for your lookups. Instead of changing a single entry in a single lookup table or case statement, you have to update every record in the main table. Option 3 is usually a denormalization performed for reporting purposes and not in live systems.
From the post above, CASE runs a little faster, but maintenance is hell if you need to upkeep the values in more than one place, and NOT enough faster to make it worth it except in all but the most extreme circumstances. 99% of the time, you want option 1 in an OLTP system.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 18, 2011 at 10:28 am
Thanks for the quick response!
Definitely option 3 is a semi denormalized option but would you agree it is the best performer in terms of SELECTs? In fact, I'm sure it is probably a better performer than option 1 in terms of inserts/updates/deletes too right since no foreign tables need to be considered during the operation.
In thinking about it, if we are trying to cover ourselves for the off chance that an additional value might be introduced in the future, then the CASE statements are the worst choice given the recoding that would have to occur in all stored procs. No recoding would have to occur in option 3.
Option 1 is definitely the purist's choice in terms of query and database design. However, would you consider it overkill in these situations where only 2 choices are ever really expected?
January 18, 2011 at 11:04 am
JEFFREY SCHELL (1/18/2011)
Thanks for the quick response!
No problem. 🙂
Definitely option 3 is a semi denormalized option but would you agree it is the best performer in terms of SELECTs? In fact, I'm sure it is probably a better performer than option 1 in terms of inserts/updates/deletes too right since no foreign tables need to be considered during the operation.
It will be better performing in terms of this specific query under most circumstances. But let's do a little extrapolation. This will get detailed:
Table has some data, let's say an ID (4 bytes), the name of the thing (VARCHAR(50)) averaging around 25 bytes, and then 6 of your lookup columns.
If these are all tiny ints, they're 1 byte each, so your average row is ~36 bytes with NULL indicators.
At 8k/page, you've got around 222 rows/page in memory. Keep in mind this will be important when scanning vs. seeking.
Now, if each of these lookups are VARCHAR(50) and averaging 25 bytes, you've extended your average row size to around 177 bytes (I need to revisit NULL indicator storage rules for exact values). That means you get 45 rows/page. Large difference.
Now, if every query uses all columns and rows, every time, this will be a better performer for READS from the table. UPDATES will be horrendous if you ever change a value in a lookup. The reason the reads will be better is because you're scanning either way and avoiding the necessary joins.
However, if you where clause on an index, the index will pull back less data to deal with with tighter rows. This may or may not give you a gain. If you don't need all these lookup columns every time, however, you will definately see a speed gain the less columns you need to lookup.
Create yourself a nice million row table to test against to see the effects of this.
In thinking about it, if we are trying to cover ourselves for the off chance that an additional value might be introduced in the future, then the CASE statements are the worst choice given the recoding that would have to occur in all stored procs. No recoding would have to occur in option 3.
Or in option 1.
Option 1 is definitely the purist's choice in terms of query and database design. However, would you consider it overkill in these situations where only 2 choices are ever really expected?
That depends on how much code I have out there that relates to these. Even in these cases, if I touch these items constantly, I might generate a single user defined function to return the correct value instead of CASE in each and every proc.
The situations you're discussing to try to save milliseconds per pass on properly indexed and organized subtables are only useful in incredibly high-data volume scenarios, or when you have significant and known bottlenecks you have to work around.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply