January 26, 2012 at 6:30 pm
I have an employee table where in many instances the same ssn is listed for the more than one employee.
I am able to identify which rows contain duplicate with group by.
Will someone show me how I can retrieve all the columns from a table, but only for those rows containing an ssn which is duplicated in another row? Group by limitation prevents listing other columns unless they are being aggregated or included in the group by.
---retrieves all
select * from dbo.employee;
---returns
empidfnamelnamemngridssn
1LarryBechold5213456123
2BarryYoung5213456123
3RobSinclaire4212671777
4LabanMeeseNULL212671777
5SargeLutvokNULL523656667
11SeigfriendAnsel16123456789
12SharpiLancaster18123456789
13LucindaWilliams20567398456
14HarrietSoldanova35567398456
---counts duplicates and groups result by lname
select count(lname) as "number of duplicates", ssn
from dbo.employee
group by ssn;
---returns. *but* I want a way to get around the group by limitation, which prevents listing other columns
number of duplicatesssn
2123456789
2212671777
2213456123
1389023456
1523656667
2567398456
---DML for table and data
CREATE TABLE [dbo].[employee](
[empid] [int] IDENTITY(1,1) NOT NULL,
[fname] [varchar](30) NULL,
[lname] [varchar](30) NULL,
[mngrid] [int] NULL,
[ssn] [int] NOT NULL)
INSERT into employee (fname, lname, mngrid, ssn)
values
('Seigfriend', 'Ansel', 16, 123456789),
('Sharpi', 'Lancaster', 18, 123456789),
('Lucinda', 'Williams', 20, 567398456),
('Harriet', 'Soldanova', 35,567398456),
('Aitch', 'Candoo', 43, 389023456)
-
January 26, 2012 at 8:19 pm
There are a couple of ways to do this, here's one:
SELECT * FROM
(
SELECT
*,
SSNCount = COUNT_BIG(*) OVER (PARTITION BY e.ssn)
FROM dbo.employee AS e
) AS q
WHERE
q.SSNCount > 1;
January 26, 2012 at 8:32 pm
Another one:
SELECT *
FROM dbo.employee AS e
WHERE EXISTS
(
SELECT 1
FROM dbo.employee AS e2
WHERE
e2.ssn = e.ssn
AND e2.empid <> e.empid
);
January 26, 2012 at 8:34 pm
A third:
SELECT e.*
FROM dbo.employee AS e
JOIN
(
SELECT e2.ssn
FROM dbo.employee AS e2
GROUP BY
e2.ssn
HAVING COUNT_BIG(*) > 1
) AS Dup ON
Dup.ssn = e.ssn;
January 26, 2012 at 10:02 pm
Thank you so much. I can't believe how good you guys are to me. I am learning here.
-
January 27, 2012 at 12:45 pm
Want to give everything a label, right now, memorize, and then draw on these from memory like tools from a waist belt... So, the first query utilizes as subquery, the second, a correlated subquery, and the third, a join.
What is the advised way within SQL Server Management Studio to assess relative cost or performance of running each of these queries?
-
January 27, 2012 at 5:42 pm
aitchkcandoo (1/27/2012)
What is the advised way within SQL Server Management Studio to assess relative cost or performance of running each of these queries?
There is no one right answer to this question, because there are a number of factors to consider. First, what aspect of cost or performance do you want to optimize for? There are many possible benchmarks: elapsed time; or resource usage (processor, disk, memory) for example. Also performance may vary depending on whether the data needed by the query has to be fetched from disk, or can be reasonably assumed to be in memory at the time. Elapsed time is easy enough to measure with Profiler, SET STATISTICS TIME statements or by recording the start and end times using T-SQL. Resource usage can be measured with Profiler, SET STATISTICS TIME and SET STATISTICS IO statements, and some of the system Dynamic Management Views (DMVs) like sys.dm_exec_query_stats. There is also the question of reading and interpreting the execution plan, but that is a huge topic in itself. One thing you should not do, is to compare total estimated costs from the execution plan - these are just used by the optimizer when choosing between plan alternatives, and may not mean very much at all in the context of your hardware and priorities. Which query form is faster or more efficient for you may also depend on the distribution of the data in the tables accessed by the query. By default, I recommend you write the query in the form that seems most natural to you, and only consider alternatives if actual performance is worse than you need. For my part, I would probably choose the EXISTS form to start with, because it can stop searching one of the tables as soon as a match is found.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply