Query Duplicates

Question

Query Duplicates

gspeedtech

SSC Rookie

Points: 49
More actions
August 17, 2010 at 7:43 pm

#157155

Hello, I just had an interview where I was asked to write a query that only returned records when a member id occured 3 or more times.
Example: A table contains records of customer complaints with name, email, comments etc.
Return results where the customer name has appeared in the table 3 or more times.
I tried:
SELECT name, email, comments,
COUNT(name) AS NumOccurrences
FROM custComplaint
GROUP BY name, email, comments
HAVING ( COUNT(name) > 2 )
but NumOccurrances always equals 1 so no results.
If I change to HAVING ( COUNT(name) = 1 )
it returns all records.
Am I taking the wrong approach? Or is there something missing in my query?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

pankushmehta Ten Centuries Points: 1217 More actions · Answer 1

This approach seems correct. the NumOfOccurences always showing 1 means there are no duplicate names in your table. try and insert 2 duplicate values(in dev environment) and then re-run the query. it will return rows that have more than 3 occurences.

try below:

declare @t table (name varchar(255), email varchar(255), comments varchar(255))

insert into @t

select 'abc' as name, 'abc.abc' as email, 'Duplicate' as comments

insert into @t

select 'abc' as name, 'abc.abc' as email, 'Duplicate' as comments

insert into @t

select 'abc' as name, 'abc.abc' as email, 'Duplicate' as comments

insert into @t

select 'cba' as name, 'cba.cba' as email, 'Single' as comments

SELECT name, email, comments,

COUNT(name) AS NumOccurrences

FROM @t

GROUP BY name, email, comments

HAVING (COUNT(name) > 2)

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 2

try to use ROWNUMBER() function , it will give you more flexibilty

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

gspeedtech SSC Rookie Points: 49 More actions · Answer 3

Thanks for the reply,

When I paste your query it works perfectly, but when I change it to apply to one of my existing tables NumOccurrences always = 1 even if there are obviuos duplicates.

The table I am using is :

tblContacts

With Columns:

CONID int

Type nvarchar(10)

LID int

CID int

ContactCode int

Notes text

Date datetime

EmpID int

Query:

SELECT CONID, Type, LID, CID, ContactCode,

COUNT(CID) AS NumOccurrences

FROM tblContacts

GROUP BY CONID, Type,LID, CID, ContactCode, Date, EmpID

HAVING (COUNT(CID) > 2 )

Returns an empty table.

When I change > 2 to = 1 I see why.

The results I am getting:

CONID TYPE LID CID ContactCode NumOccurrences

30772 Group 12 1 0 1

31249 Group 12 1 3 1

31840 Group 15 3 0 1

31511 Group 10 7 0 1

31515 Group 10 7 3 1

31251 Group 13 8 3 1

31061 Group 13 8 0 1

31374 Group 13 8 3 1

As you can see there are many instances where CID duplicates occur and yet the NumOccurrences = 1

Richard M. SSCertifiable Points: 7401 More actions · Answer 4

The problem here is that you are grouping by "CONID, Type,LID, CID, ContactCode, Date, EmpID", so only repetitions of same data for all those rows would "add" to the count you are doing.

You probably want to aggregate the data first and then display those and the other columns you want to display.

_______________________________________________________________________
For better assistance in answering your questions, click here[/url]

Ron McCullough SSC Guru Points: 63877 More actions · Answer 5

gspeedtech

To assist you in understanding the problem (Or at least I hope this help)

Execute this command

SET SHOWPLAN_ALL ON

GO

1. Execute your query

2. Examine the execution plan. Note the Query Optimizer has elected to use a "Stream Aggregate Operator".

3. Using Books On Line subject "Stream Aggregate Showplan Operator " execute the example given.

4. Review what this implies;(Emphasis added by the poster)

the query optimizer uses the Stream Aggregate operator to group the rows in the SalesPerson table by the TerritoryID column and then calculate the values for the AVG(Bonus) and SUM(SalesYTD) columns based on that grouping.

I hope this assists you in gaining an understanding of what is happening "under the covers" - so to speak.

If you are further interested in learning what value an execution plan can be to understanding what is happening vist:

http://www.sqlservercentral.com/articles/books/65831/

The above link will allow you to download in PDF format an excellent book by Grant Fritchey.

If the above does not assist you in understand what is happening post again and some one more familiar with T-SQL might be able to assist you further.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

gspeedtech SSC Rookie Points: 49 More actions · Answer 6

Eureka! I Found it.

I created a temp table and Inner Joined it to produce the desired results.

SELECT CID,

COUNT(CID) AS NumOccurrences

Into tblCID FROM tblContacts

GROUP BY CID

HAVING (COUNT(CID) > 2)

order by CID

Select CONID, Type, LID, tblContacts.CID, ContactCode

from tblcontacts

inner Join tblCID

on tblContacts.CID = tblCID.CID

order by tblContacts.CID

Drop Table tblCID

Please let me know if you can see why my approach would be ill advised or offer inconsistant results.

Thanks to all for the help.

gspeedtech SSC Rookie Points: 49 More actions · Answer 7

bitbucket-25253

I will definitely look into this to get a better understanding.

Thanks for the reply!

Ron McCullough SSC Guru Points: 63877 More actions · Answer 8

Review the effect of your "SELECT INTO"

From BOL:

Note that your last posting of the T-SQL that works:

The SELECT INTO statement creates a new table and populates it with the result set of the SELECT statement

IF, and that is a big assumption on my part.... looking at your last posted T_SQL the table

SELECT CID,

COUNT(CID) AS NumOccurrences

Into tblCID FROM tblContacts

table "tblCID" does not exist until your T-SQL executes and since the query optimizer is attempting to process the data in the most efficient manner - BEFORE the query is actually executed it of course can not find the nonexistent table. You might want to use a CREATE TABLE statement and then insert the data into it... Check CREATE TABLE in BOL and by the way read up on a true temporary table defined as #tablename .. or #tblCid - which must be created and then populated.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]