If I had a dollar for every CTE solution to a simple query in the forums (pick any SQL forum), I’d be Mark Cuban. With that said I thought it would be a good idea to ask if a CTE was actually the right solution or if it’s just the new trend and everyone uses them just because they can and all the cool kids are too. Let’s put it this way: If I had a dollar for every time a CTE was misused, I’d be Mark Cuban… or at least Robert Herjavec.
Can you tell I like Shark Tank? Haha. I was happy to see that Shark Tank was on TV when I went to Toronto then at a commercial break I was shocked to see the show was named “Dragons’ Den”, but enough about that… Let’s get down to business.
Why use a CTE?
To start, they’re really cool. They provide recursion which can dramatically simplify complex queries that require recursion. For more about recursive queries check out this TechNet article: https://technet.microsoft.com/en-us/library/ms186243.aspx
Should I use a CTE for everything?
Hey! Here’s a SQL question were you don’t get to answer “It depends”. That is unless every query you run needs to be recursive. haha!
Have you ever been in an interview where you were asked about how you would build a copy of IMDB? That’s the Internet Movie Database in case you didn’t already know. I’ve had this question a few times and I liked it so much that I sometimes use it when I’m interviewing a candidate in person. Once I had a candidate that said they wouldn’t know where to start because they don’t watch or like movies; so, I changed it from IMDB to a music collection which is similar in concept. They replied “I don’t like or listen to music.”
Let’s say you build a very simple database to store your movies and meta data about them. You’d probably have something that looked like this:
CREATE TABLE movie (movieid int, title varchar(10))
CREATE TABLE person (personid int, name varchar(10))
CREATE TABLE movieperson (movieid int, personid int)
You could add on to that with a table that described the person’s title, such as director, actor, key grip, and more. There’s a lot you could do to mature the data model. But let’s keep this very simple and just like it is. You have a list of movie titles, people associated with the movie, and a many-to-many join table.
You want to grab all the movies that Bob was in. Simple:
SELECT DISTINCT m.movieid, m.title
FROM movieperson am
INNER JOIN person a ON (a.personid = am.personid) and (a.name = ‘bob’)
INNER JOIN movie m ON (m.movieid = am.movieid)
Now let’s say you want to get a little more complex. Show all the movies Bob, Joe, and Sue were in together. Can you just use an IN clause?
Here’s the DDL to populate the tables with some test data. Give it a try:
INSERT INTO movie (movieid, title) VALUES (1, ‘sw i’), (2, ‘sw ii’), (3, ‘sw iii’), (4, ‘sw iv’), (5, ‘sw v’), (6, ‘sw vi’)
INSERT INTO person (personid, name) VALUES (1, ‘bob’), (2, ‘joe’), (3, ‘sue’), (4, ‘jimmy’)
INSERT INTO movieperson (movieid, personid) VALUES (1,1), (2,2), (3,1), (3,2), (4,1), (4,2), (4,4), (5,1), (5,3), (6,1), (6,3), (6,2)
What was your solution?
What did you come up with? The answer is only “sw vi”. Did you get it? What did you’re query look like? Did you use a CTE?
Let’s look at a CTE and then see how else this may be handled if any other way at all.
;WITH cte as
( SELECT row_number() over (partition by mp.movieid ORDER BY mp.movieid) as row_num,
mp.movieid,
mp.personid
FROM movieperson mp
INNER JOIN person p on mp.personid = p.personid
WHERE p.name in (‘bob’,’joe’,’sue’)
)
SELECT m.title
FROM movie m
INNER JOIN cte bj on m.movieid = bj.movieid
WHERE bj.row_num = 3
This approach seems easy enough. The names can be placed in the IN and then simply increment the row_num by however many rows you place in the IN.
I don’t like an IN and perhaps I’ll do a post on why in the coming weeks. For now let’s look at the cost of this approach.
This query produces a plan that costs 0.0402931 and has performs the following IO
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘movie’. Scan count 1, logical reads 1, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘person’. Scan count 1, logical reads 12, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘movieperson’. Scan count 1, logical reads 1, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Now let’s ask ourselves, Can this query be written better? Did this query require recursion?
Let’s say we want to make this query extremely modular. We want to bolt on as many names from the application with a very simple and straight forward way of coding. Something that a code first developer would be proud of.
We already know how to pull the data for one name. Why not use the INTERSECT function?
SELECT distinct(m.Title) from Movie m
INNER JOIN movieperson mp on m.Movieid = mp.movieID
INNER JOIN person p on mp.PersonID = p.PersonID
WHERE p.Name = ‘Bob’
INTERSECT
SELECT distinct(m.Title) from Movie m
INNER JOIN movieperson mp on m.Movieid = mp.movieID
INNER JOIN person p on mp.PersonID = p.PersonID
WHERE p.Name = ‘Joe’
INTERSECT
SELECT distinct(m.Title) from Movie m
INNER JOIN movieperson mp on m.Movieid = mp.movieID
INNER JOIN person p on mp.PersonID = p.PersonID
WHERE p.Name = ‘Sue’
This approach makes the code extremely simple to bolt on as many names as you want. Is it better?
The cost of this query is 0.0784484 nearly twice as slow. The IO performed is quite a bit more as well.
Table ‘movie’. Scan count 3, logical reads 24, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘movieperson’. Scan count 3, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘person’. Scan count 3, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Does this mean the CTE is the clear winner? For this scenario I’d say yes; but, wait! There’s more!! Don’t count out traditional SQL just yet.
Did you write your original query like this?
SELECT m.title
FROM movieperson am
INNER JOIN person a ON (a.personid = am.personid) and (a.name in (‘bob’, ‘joe’, ‘sue’))
INNER JOIN movie m ON (m.movieid = am.movieid)
GROUP BY am.movieid, m.title
HAVING count(am.movieid) = (SELECT count(*) FROM person where name in (‘bob’, ‘joe’, ‘sue’))
If you did then congratulations! You started with the fastest solution. Let’s look at the numbers:
This approach costs 0.0262887 and performs the following IO
Table ‘person’. Scan count 2, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘movieperson’. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘movie’. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The IO here is a bit higher. Is it actually faster? Let’s use STATISTICS TIME and see.
It’s very close but these are very light weight queries as well. The CTE ranges from 38 to 118 ms execution time on my Surface Pro 3 256 and the “classic” statement ranges from 33 to 110 ms.
I think the key learning points here are there are many ways to accomplish the desired query results but they aren’t all created equal. If you happen to have a developer that writes everything as a CTE because all the cool kids are doing the same you may want to coach them on some of the “old school” query techniques.
I truly hope you found this post both entertaining and helpful. Be sure to follow me on twitter (@SQLSME) and if you found this useful endorse my skills on LinkedIn (https://www.linkedin.com/in/daniel-janik-60615b12)
Also be sure to check out my other blogs on SQL Server Central (http://www.sqlservercentral.com/blogs/confessions-of-a-microsoft-addict/) and www.SQLTechBlog.com