Label Partitioning Data based on matching attributes

  • Hi,

    I am trying to figure if there is a way to group partitioned data based on duplication criteria.

    Please see the attached file for test data and resulting output that I am trying to accomplish.

    It is easier to explain through the example attached. Also, pasting the content of the attachment here.

    --Table with unique itemnumbers with their attributes

    create table T1

    (

    ItemNum int not null primary key,

    Attribute1 varchar(100) null,

    Attribute2 varchar(100) null,

    Attribute3 varchar(100) null

    )

    --Populating the table with Itemnumbers and their attributes.

    --Some items numbers has matching attributes with other items and they are considered as duplicate --items

    insert into T1

    values

    (1,'AAA','BBB','CCC'),

    (2,'DDD','EEE','FFF'),

    (3,'GGG','HHH','III'),

    (4,'AAA','BBB','CCC'),

    (5,'DDD','EEE','FFF'),

    (6,'GGG','HHH','III'),

    (7,'AAA','BBB','CCC'),

    (8,'DDD','EEE','FFF'),

    (9,'JJJ','KKK','LLL'),

    (10,'JJJ','KKK','LLL')

    --Result Desired, the items are considered duplicates based on matching values for Attribute1,

    --Attribute2, and Attribute3. For example, item 1,4,and 7 are duplicates based on their attributes and --hence are labeled aa GroupA.

    Item Category

    1 GroupA

    4 GroupA

    7 GroupA

    2 GroupB

    5 GroupB

    8 GroupB

    3 GroupC

    6 GroupC

    9 GroupD

    10 GroupD

    --I am starting off with following script to partition the data, however,not sure how to label them into groups.

    select

    ItemNum,

    row_number() over(partition by Attribute1,Attribute2,Attribute3 order by ItemNum) as Rownum

    from T1

    order by ItemNum

  • Hi,

    I wanted to edit the title of this post but I believe that is not possible.

    I wanted review the title of this post to state " Grouping unique items based on matching attributes" instead as partitioning word might confuse this topic with table/data partitioning.

    Any feedback to the post will be appreciated.

    Thanks

  • I am using the following query to partition the items based on matching attributes, however, not sure if there is any function which will label them into groups based on the matching attributes.

    select

    ItemNum,

    row_number() over(partition by Attribute1,Attribute2,Attribute3 order by ItemNum) as Rownum

    from T1

    order by ItemNum

  • SELECT

    ItemNum,

    Category =

    'Group' + CHAR(64 +

    DENSE_RANK() OVER(

    ORDER BY Attribute1,Attribute2,Attribute3))

    FROM T1;

  • SQL Kiwi (2/13/2012)


    SELECT

    ItemNum,

    Category =

    'Group' + CHAR(64 +

    DENSE_RANK() OVER(

    ORDER BY Attribute1,Attribute2,Attribute3))

    FROM T1;

    The more I learn about the rank() functions, the more impressed I am at how they can be used. Thanks Paul!

    Jared
    CE - Microsoft

  • Thanks a lot Paul for the tip.

    I was guessing it would be a rank functions but since I never used before, could not get it to work.

    Is there way in this solution to only return the items that have matching attribute with another items?

    The only thing I care about is the to find the items that have matching attribute as they are duplicates in my case.

    Thanks

  • Thanks a lot Paul for the tip.

  • Similar to Paul's suggestion:

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) as DupNR

    FROM T1;

    All rows having a value higher than 1 in DupNR are duplicates of the row that has DupNR = 1 (with no preference on the row that does get DupNR = 1). Put the above query in a cte and you can use it in a delete statement directly, like so:

    WITH cte AS

    (

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR

    FROM T1

    )

    DELETE cte WHERE DupNR > 1



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Hi Rozema,

    I am not trying to delete but I am rather trying to display the ones that are selected as duplicates.

    This requirement makes it trickier.

    Thanks

    R.P.Rozema (2/15/2012)


    Similar to Paul's suggestion:

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) as DupNR

    FROM T1;

    All rows having a value higher than 1 in DupNR are duplicates of the row that has DupNR = 1 (with no preference on the row that does get DupNR = 1). Put the above query in a cte and you can use it in a delete statement directly, like so:

    WITH cte AS

    (

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR

    FROM T1

    )

    DELETE cte WHERE DupNR > 1

  • Ok, then don't delete but just select the rows that have DupNR > 1. Or do I not understand your requirements correctly?



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • R.P.Rozema (2/15/2012)


    Similar to Paul's suggestion:

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) as DupNR

    FROM T1;

    All rows having a value higher than 1 in DupNR are duplicates of the row that has DupNR = 1 (with no preference on the row that does get DupNR = 1). Put the above query in a cte and you can use it in a delete statement directly, like so:

    WITH cte AS

    (

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR

    FROM T1

    )

    DELETE cte WHERE DupNR > 1

    No, just change the DELETE to SELECT * FROM

    WITH cte AS

    (

    SELECT

    ItemNum,

    ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR

    FROM T1

    )

    SELECT * FROM cte WHERE DupNR > 1

    Jared
    CE - Microsoft

  • I also need the record that has DupNR > 1 as part of the results.

    What I don't want is the rows that only have DupNR = 1, i.e. that are not duplicates based on my criteria.

    That is the reason why I was trying to label them and Paul's solutions works for that.

    I am trying to filter out the rows that have matching labels from Paul's solution but not able to.

    Does that makes sense?

  • Combine the both solutions. Take Paul's DENSE_RANK() to generate a unique name per 'group' and add another column, using my ROW_NUMBER() suggestion to give each individual member of the group a number from 1 up to the number of members. Does that answer your questions?



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Actually I did that, however I am stuck at how to only select the one that are duplicates and not the ones that are not duplicates. The row_number() function will label the same groups as 1,2,3..so on..the records are the only one in the group will have just 1 as run_number. The problem is how to only select the rows with consecutive row numbers.

  • bdba (2/15/2012)


    Actually I did that, however I am stuck at how to only select the one that are duplicates and not the ones that are not duplicates. The row_number() function will label the same groups as 1,2,3..so on..the records are the only one in the group will have just 1 as run_number. The problem is how to only select the rows with consecutive row numbers.

    Please provide clear sample data and *expected output*. Any one of us can provide a good T-SQL solution if the requirement is explained well...

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply