Conditional Group by?

Question

Post reply

Conditional Group by?

Robert Klimes

SSCoach

Points: 17865
More actions
November 12, 2009 at 9:36 am

#135240

I am looking for a way to exclude rows that have a null value in a column if the are other rows that have a value in that column for the same name. if there are only nulls in that column i would like it returned with the nulls.
create table #tmp(name varchar(10),
email varchar(50),
code varchar(5),
location varchar(50))
insert into #tmp(name,email,code,location)
values('bob','bob@bob.com','ABCDE','USA')
insert into #tmp(name,email,code)
values('bob','bob@bob.com','ZYXWV')
insert into #tmp(name,email)
values('joe','joe@bob.com')
insert into #tmp(name,email,code,location)
values('joe','joe@bob.com','ABCDE','USA')
insert into #tmp(name,email)
values('dan','dan@bob.com')
select name,email,ISNULL(code,'') AS code,MAX(isnull(location,'')) as location from #tmp
group by name,email,ISNULL(code,'')
desired result
nameemail codelocation
bobbob@bob.comABCDEUSA
bobbob@bob.comZYXWV
dandan@bob.com
joejoe@bob.comABCDEUSA
I am looking for something that would theoretically be accomplished by the following code, although this code does not work.
select name,
email,
case
when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email)
else code
end as code,
MAX(isnull(location,'')) as location
from #tmp a
group by name,email,case
when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email)
else code
end
Bob
-----------------------------------------------------------------------------
How to post to get the best help[/url]

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Supriya.S Ten Centuries Points: 1131 More actions · Answer 1

How about something like this:

SELECT * FROM #tmp A

WHERE code IS NOT NULL

OR location IS NOT NULL

OR EXISTS (SELECT COUNT(*) FROM #tmp B WHERE A.name=B.name AND A.email=B.email GROUP BY name,email HAVING COUNT(*)=1)

HTH,

Supriya

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 2

Here is one other way:SELECT

name,

email,

code,

location

FROM

(

select

name,

email,

ISNULL(code,'') AS code,

MAX(isnull(location,'')) as location,

ROW_NUMBER() OVER

(

PARTITION BY

Name,

Email

ORDER BY

CASE WHEN code IS NULL THEN 1 ELSE 0 END,

code

) AS RowNum

from

#tmp

group by

name,

email,

code

) AS T

WHERE

Code <> ''

OR (RowNum = 1 AND Code = '')

Robert Klimes SSCoach Points: 17865 More actions · Answer 3

Thank for the replies.

Supriya, I was able to use your example to get the results I needed.

Bob
-----------------------------------------------------------------------------
How to post to get the best help[/url]

Supriya.S Ten Centuries Points: 1131 More actions · Answer 4

@Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?

@bob-2: glad it worked out for you.

Thanks,

Supriya

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 5

Supriya.S (11/12/2009)
@Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?

I just was messing around with a query to get the expected results based on the sample data. At the time I was looking at the expected output no one had come up with a solution. While, my code produces the expected output, it probably doesn't cover everything. If I ahve some time I'll see if I can tweak it.

LutzM SSC Guru Points: 107049 More actions · Answer 6

Here's another solution based on ROW_NUMBER.

It covers the issue of having code is null and location is not null.

Comparing the two row_number solutions will result in almost identical execution plans.

The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)

; WITH cte AS

(

SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

name,

email,

code,

location

FROM #tmp

)

SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location

FROM cte

WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL

ORDER BY name,code

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 7

If you change my WHERE cluase to:WHERE

RowNum = 1 OR Code <> '' OR Location <> ''I think that'll work.

lmu92's solution and mine differ slightly because of there the GROUPing happens. So, the method one would choose would depend on the actual results desired...

Cheers!

Supriya.S Ten Centuries Points: 1131 More actions · Answer 8

lmu92 (11/12/2009)
Here's another solution based on ROW_NUMBER.
It covers the issue of having code is null and location is not null.
Comparing the two row_number solutions will result in almost identical execution plans.
The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)
; WITH cte AS
(
SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,
name,
email,
code,
location
FROM #tmp
)
SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location
FROM cte
WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL
ORDER BY name,code

Lutz, thanks for posting the query. That goes into my "SQL tips and tricks" file too. :-). Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?

Sorry, if I am asking too many questions. 😀

Thanks,

Supriya

LutzM SSC Guru Points: 107049 More actions · Answer 9

Supriya.S (11/12/2009)
Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?

Like I said: The "Exists approach" does look a lot faster when just looking at the execution plan. So we both figured the same...

After that I created the following "test scenario":

create and fill sample data

SET STATISTICS TIME ON

print 'row_number cte'

my query

print 'Exists approach'

your query

print 'row_number sub query'

Lamprey13' query

SET STATISTICS TIME OFF

The results I got showed that the 'Exists' query took longer than each

Row_number query and had more IO's and CPU time.

Since we're not talking about a large number of rows I recommended testing it against a larger data volume and see which one of the three would perform best.

Will adding a non-clustered index on [name] and help?

Seems like your query would benefit most from this index due to the join (a table scan on one side and an index seek for #tmp B. The row_number solutions probably still will perform a table scan since code and location are not part of the index (would require bookmark lookup which most probably would be too expensive).

Again, testing against a larger data volume would help eliminate my guessing (I'm by far no expert on performance tuning compared to some of the folks around here...).

Edit:

For testing purposes you probably should change my ROW_Number statement from

SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

to

SELECT row_number() OVER (partition BY name,email ORDER BY code desc, location desc) AS row,

in order to use the same assumptions like both of you did...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Supriya.S Ten Centuries Points: 1131 More actions · Answer 10

Thanks for taking the time out to explain Lutz, appreciate it.

Thanks,

Supriya