help with query that isolates records I need,and still allows me to list all columns

  • I have an employee table where in many instances the same ssn is listed for the more than one employee.

    I am able to identify which rows contain duplicate with group by.

    Will someone show me how I can retrieve all the columns from a table, but only for those rows containing an ssn which is duplicated in another row? Group by limitation prevents listing other columns unless they are being aggregated or included in the group by.

    ---retrieves all

    select * from dbo.employee;

    ---returns

    empidfnamelnamemngridssn

    1LarryBechold5213456123

    2BarryYoung5213456123

    3RobSinclaire4212671777

    4LabanMeeseNULL212671777

    5SargeLutvokNULL523656667

    11SeigfriendAnsel16123456789

    12SharpiLancaster18123456789

    13LucindaWilliams20567398456

    14HarrietSoldanova35567398456

    ---counts duplicates and groups result by lname

    select count(lname) as "number of duplicates", ssn

    from dbo.employee

    group by ssn;

    ---returns. *but* I want a way to get around the group by limitation, which prevents listing other columns

    number of duplicatesssn

    2123456789

    2212671777

    2213456123

    1389023456

    1523656667

    2567398456

    ---DML for table and data

    CREATE TABLE [dbo].[employee](

    [empid] [int] IDENTITY(1,1) NOT NULL,

    [fname] [varchar](30) NULL,

    [lname] [varchar](30) NULL,

    [mngrid] [int] NULL,

    [ssn] [int] NOT NULL)

    INSERT into employee (fname, lname, mngrid, ssn)

    values

    ('Seigfriend', 'Ansel', 16, 123456789),

    ('Sharpi', 'Lancaster', 18, 123456789),

    ('Lucinda', 'Williams', 20, 567398456),

    ('Harriet', 'Soldanova', 35,567398456),

    ('Aitch', 'Candoo', 43, 389023456)

    -

  • There are a couple of ways to do this, here's one:

    SELECT * FROM

    (

    SELECT

    *,

    SSNCount = COUNT_BIG(*) OVER (PARTITION BY e.ssn)

    FROM dbo.employee AS e

    ) AS q

    WHERE

    q.SSNCount > 1;

  • Another one:

    SELECT *

    FROM dbo.employee AS e

    WHERE EXISTS

    (

    SELECT 1

    FROM dbo.employee AS e2

    WHERE

    e2.ssn = e.ssn

    AND e2.empid <> e.empid

    );

  • A third:

    SELECT e.*

    FROM dbo.employee AS e

    JOIN

    (

    SELECT e2.ssn

    FROM dbo.employee AS e2

    GROUP BY

    e2.ssn

    HAVING COUNT_BIG(*) > 1

    ) AS Dup ON

    Dup.ssn = e.ssn;

  • Thank you so much. I can't believe how good you guys are to me. I am learning here.

    -

  • Want to give everything a label, right now, memorize, and then draw on these from memory like tools from a waist belt... So, the first query utilizes as subquery, the second, a correlated subquery, and the third, a join.

    What is the advised way within SQL Server Management Studio to assess relative cost or performance of running each of these queries?

    -

  • aitchkcandoo (1/27/2012)


    What is the advised way within SQL Server Management Studio to assess relative cost or performance of running each of these queries?

    There is no one right answer to this question, because there are a number of factors to consider. First, what aspect of cost or performance do you want to optimize for? There are many possible benchmarks: elapsed time; or resource usage (processor, disk, memory) for example. Also performance may vary depending on whether the data needed by the query has to be fetched from disk, or can be reasonably assumed to be in memory at the time. Elapsed time is easy enough to measure with Profiler, SET STATISTICS TIME statements or by recording the start and end times using T-SQL. Resource usage can be measured with Profiler, SET STATISTICS TIME and SET STATISTICS IO statements, and some of the system Dynamic Management Views (DMVs) like sys.dm_exec_query_stats. There is also the question of reading and interpreting the execution plan, but that is a huge topic in itself. One thing you should not do, is to compare total estimated costs from the execution plan - these are just used by the optimizer when choosing between plan alternatives, and may not mean very much at all in the context of your hardware and priorities. Which query form is faster or more efficient for you may also depend on the distribution of the data in the tables accessed by the query. By default, I recommend you write the query in the form that seems most natural to you, and only consider alternatives if actual performance is worse than you need. For my part, I would probably choose the EXISTS form to start with, because it can stop searching one of the tables as soon as a match is found.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply