November 12, 2009 at 9:36 am
I am looking for a way to exclude rows that have a null value in a column if the are other rows that have a value in that column for the same name. if there are only nulls in that column i would like it returned with the nulls.
create table #tmp(name varchar(10),
email varchar(50),
code varchar(5),
location varchar(50))
insert into #tmp(name,email,code,location)
values('bob','bob@bob.com','ABCDE','USA')
insert into #tmp(name,email,code)
values('bob','bob@bob.com','ZYXWV')
insert into #tmp(name,email)
values('joe','joe@bob.com')
insert into #tmp(name,email,code,location)
values('joe','joe@bob.com','ABCDE','USA')
insert into #tmp(name,email)
values('dan','dan@bob.com')
select name,email,ISNULL(code,'') AS code,MAX(isnull(location,'')) as location from #tmp
group by name,email,ISNULL(code,'')
desired result
nameemail codelocation
bobbob@bob.comABCDEUSA
bobbob@bob.comZYXWV
dandan@bob.com
joejoe@bob.comABCDEUSA
I am looking for something that would theoretically be accomplished by the following code, although this code does not work.
select name,
email,
case
when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email)
else code
end as code,
MAX(isnull(location,'')) as location
from #tmp a
group by name,email,case
when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email)
else code
end
November 12, 2009 at 10:10 am
How about something like this:
SELECT * FROM #tmp A
WHERE code IS NOT NULL
OR location IS NOT NULL
OR EXISTS (SELECT COUNT(*) FROM #tmp B WHERE A.name=B.name AND A.email=B.email GROUP BY name,email HAVING COUNT(*)=1)
HTH,
Supriya
November 12, 2009 at 12:01 pm
Here is one other way:SELECT
name,
email,
code,
location
FROM
(
select
name,
email,
ISNULL(code,'') AS code,
MAX(isnull(location,'')) as location,
ROW_NUMBER() OVER
(
PARTITION BY
Name,
ORDER BY
CASE WHEN code IS NULL THEN 1 ELSE 0 END,
code
) AS RowNum
from
#tmp
group by
name,
email,
code
) AS T
WHERE
Code <> ''
OR (RowNum = 1 AND Code = '')
November 12, 2009 at 12:50 pm
November 12, 2009 at 1:09 pm
@Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?
@bob-2: glad it worked out for you.
Thanks,
Supriya
November 12, 2009 at 1:44 pm
Supriya.S (11/12/2009)
@Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?
I just was messing around with a query to get the expected results based on the sample data. At the time I was looking at the expected output no one had come up with a solution. While, my code produces the expected output, it probably doesn't cover everything. If I ahve some time I'll see if I can tweak it.
November 12, 2009 at 2:17 pm
Here's another solution based on ROW_NUMBER.
It covers the issue of having code is null and location is not null.
Comparing the two row_number solutions will result in almost identical execution plans.
The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)
; WITH cte AS
(
SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,
name,
email,
code,
location
FROM #tmp
)
SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location
FROM cte
WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL
ORDER BY name,code
November 12, 2009 at 2:28 pm
If you change my WHERE cluase to:WHERE
RowNum = 1 OR Code <> '' OR Location <> ''I think that'll work.
lmu92's solution and mine differ slightly because of there the GROUPing happens. So, the method one would choose would depend on the actual results desired...
Cheers!
November 12, 2009 at 3:00 pm
lmu92 (11/12/2009)
Here's another solution based on ROW_NUMBER.It covers the issue of having code is null and location is not null.
Comparing the two row_number solutions will result in almost identical execution plans.
The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)
; WITH cte AS
(
SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,
name,
email,
code,
location
FROM #tmp
)
SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location
FROM cte
WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL
ORDER BY name,code
Lutz, thanks for posting the query. That goes into my "SQL tips and tricks" file too. :-). Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?
Sorry, if I am asking too many questions. 😀
Thanks,
Supriya
November 12, 2009 at 3:51 pm
Supriya.S (11/12/2009)
Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?
Like I said: The "Exists approach" does look a lot faster when just looking at the execution plan. So we both figured the same...
After that I created the following "test scenario":
create and fill sample data
SET STATISTICS TIME ON
SET STATISTICS TIME ON
print 'row_number cte'
my query
print 'Exists approach'
your query
print 'row_number sub query'
Lamprey13' query
SET STATISTICS TIME OFF
SET STATISTICS TIME OFF
The results I got showed that the 'Exists' query took longer than each
Row_number query and had more IO's and CPU time.
Since we're not talking about a large number of rows I recommended testing it against a larger data volume and see which one of the three would perform best.
Will adding a non-clustered index on [name] and help?
Seems like your query would benefit most from this index due to the join (a table scan on one side and an index seek for #tmp B. The row_number solutions probably still will perform a table scan since code and location are not part of the index (would require bookmark lookup which most probably would be too expensive).
Again, testing against a larger data volume would help eliminate my guessing (I'm by far no expert on performance tuning compared to some of the folks around here...).
Edit:
For testing purposes you probably should change my ROW_Number statement from
SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,
to
SELECT row_number() OVER (partition BY name,email ORDER BY code desc, location desc) AS row,
in order to use the same assumptions like both of you did...
November 13, 2009 at 1:58 pm
Thanks for taking the time out to explain Lutz, appreciate it.
Thanks,
Supriya
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply