help with query that isolates records I need,and still allows me to list all columns

Question

Post reply

help with query that isolates records I need,and still allows me to list all columns

xenophilia

SSCommitted

Points: 1891
More actions
January 26, 2012 at 6:30 pm

#255561

I have an employee table where in many instances the same ssn is listed for the more than one employee.
I am able to identify which rows contain duplicate with group by.
Will someone show me how I can retrieve all the columns from a table, but only for those rows containing an ssn which is duplicated in another row? Group by limitation prevents listing other columns unless they are being aggregated or included in the group by.
---retrieves all
select * from dbo.employee;
---returns
empidfnamelnamemngridssn
1LarryBechold5213456123
2BarryYoung5213456123
3RobSinclaire4212671777
4LabanMeeseNULL212671777
5SargeLutvokNULL523656667
11SeigfriendAnsel16123456789
12SharpiLancaster18123456789
13LucindaWilliams20567398456
14HarrietSoldanova35567398456
---counts duplicates and groups result by lname
select count(lname) as "number of duplicates", ssn
from dbo.employee
group by ssn;
---returns. *but* I want a way to get around the group by limitation, which prevents listing other columns
number of duplicatesssn
2123456789
2212671777
2213456123
1389023456
1523656667
2567398456
---DML for table and data
CREATE TABLE [dbo].[employee](
[empid] [int] IDENTITY(1,1) NOT NULL,
[fname] [varchar](30) NULL,
[lname] [varchar](30) NULL,
[mngrid] [int] NULL,
[ssn] [int] NOT NULL)
INSERT into employee (fname, lname, mngrid, ssn)
values
('Seigfriend', 'Ansel', 16, 123456789),
('Sharpi', 'Lancaster', 18, 123456789),
('Lucinda', 'Williams', 20, 567398456),
('Harriet', 'Soldanova', 35,567398456),
('Aitch', 'Candoo', 43, 389023456)
-

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150468 More actions · Answer 1

There are a couple of ways to do this, here's one:

SELECT * FROM

(

SELECT

*,

SSNCount = COUNT_BIG(*) OVER (PARTITION BY e.ssn)

FROM dbo.employee AS e

) AS q

WHERE

q.SSNCount > 1;

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 2

Another one:

SELECT *

FROM dbo.employee AS e

WHERE EXISTS

(

SELECT 1

FROM dbo.employee AS e2

WHERE

e2.ssn = e.ssn

AND e2.empid <> e.empid

);

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 3

A third:

SELECT e.*

FROM dbo.employee AS e

JOIN

(

SELECT e2.ssn

FROM dbo.employee AS e2

GROUP BY

e2.ssn

HAVING COUNT_BIG(*) > 1

) AS Dup ON

Dup.ssn = e.ssn;

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

xenophilia SSCommitted Points: 1891 More actions · Answer 4

Thank you so much. I can't believe how good you guys are to me. I am learning here.

-

xenophilia SSCommitted Points: 1891 More actions · Answer 5

Want to give everything a label, right now, memorize, and then draw on these from memory like tools from a waist belt... So, the first query utilizes as subquery, the second, a correlated subquery, and the third, a join.

What is the advised way within SQL Server Management Studio to assess relative cost or performance of running each of these queries?

-

Paul White SSC Guru Points: 150468 More actions · Answer 6

aitchkcandoo (1/27/2012)
What is the advised way within SQL Server Management Studio to assess relative cost or performance of running each of these queries?

There is no one right answer to this question, because there are a number of factors to consider. First, what aspect of cost or performance do you want to optimize for? There are many possible benchmarks: elapsed time; or resource usage (processor, disk, memory) for example. Also performance may vary depending on whether the data needed by the query has to be fetched from disk, or can be reasonably assumed to be in memory at the time. Elapsed time is easy enough to measure with Profiler, SET STATISTICS TIME statements or by recording the start and end times using T-SQL. Resource usage can be measured with Profiler, SET STATISTICS TIME and SET STATISTICS IO statements, and some of the system Dynamic Management Views (DMVs) like sys.dm_exec_query_stats. There is also the question of reading and interpreting the execution plan, but that is a huge topic in itself. One thing you should not do, is to compare total estimated costs from the execution plan - these are just used by the optimizer when choosing between plan alternatives, and may not mean very much at all in the context of your hardware and priorities. Which query form is faster or more efficient for you may also depend on the distribution of the data in the tables accessed by the query. By default, I recommend you write the query in the form that seems most natural to you, and only consider alternatives if actual performance is worse than you need. For my part, I would probably choose the EXISTS form to start with, because it can stop searching one of the tables as soon as a match is found.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi