Label Partitioning Data based on matching attributes

Question

Post reply

Label Partitioning Data based on matching attributes

bdba

SSCommitted

Points: 1504
More actions
February 12, 2012 at 5:05 pm

#391542

Hi,
I am trying to figure if there is a way to group partitioned data based on duplication criteria.
Please see the attached file for test data and resulting output that I am trying to accomplish.
It is easier to explain through the example attached. Also, pasting the content of the attachment here.
--Table with unique itemnumbers with their attributes
create table T1
(
ItemNum int not null primary key,
Attribute1 varchar(100) null,
Attribute2 varchar(100) null,
Attribute3 varchar(100) null
)
--Populating the table with Itemnumbers and their attributes.
--Some items numbers has matching attributes with other items and they are considered as duplicate --items
insert into T1
values
(1,'AAA','BBB','CCC'),
(2,'DDD','EEE','FFF'),
(3,'GGG','HHH','III'),
(4,'AAA','BBB','CCC'),
(5,'DDD','EEE','FFF'),
(6,'GGG','HHH','III'),
(7,'AAA','BBB','CCC'),
(8,'DDD','EEE','FFF'),
(9,'JJJ','KKK','LLL'),
(10,'JJJ','KKK','LLL')
--Result Desired, the items are considered duplicates based on matching values for Attribute1,
--Attribute2, and Attribute3. For example, item 1,4,and 7 are duplicates based on their attributes and --hence are labeled aa GroupA.
Item Category
1 GroupA
4 GroupA
7 GroupA
2 GroupB
5 GroupB
8 GroupB
3 GroupC
6 GroupC
9 GroupD
10 GroupD
--I am starting off with following script to partition the data, however,not sure how to label them into groups.
select
ItemNum,
row_number() over(partition by Attribute1,Attribute2,Attribute3 order by ItemNum) as Rownum
from T1
order by ItemNum

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

bdba SSCommitted Points: 1504 More actions · Answer 1

Hi,

I wanted to edit the title of this post but I believe that is not possible.

I wanted review the title of this post to state " Grouping unique items based on matching attributes" instead as partitioning word might confuse this topic with table/data partitioning.

Any feedback to the post will be appreciated.

Thanks

bdba SSCommitted Points: 1504 More actions · Answer 2

I am using the following query to partition the items based on matching attributes, however, not sure if there is any function which will label them into groups based on the matching attributes.

select

ItemNum,

row_number() over(partition by Attribute1,Attribute2,Attribute3 order by ItemNum) as Rownum

from T1

order by ItemNum

Paul White SSC Guru Points: 150468 More actions · Answer 3

SELECT

ItemNum,

Category =

'Group' + CHAR(64 +

DENSE_RANK() OVER(

ORDER BY Attribute1,Attribute2,Attribute3))

FROM T1;

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jared SSC Guru Points: 61793 More actions · Answer 4

SQL Kiwi (2/13/2012)
SELECT ItemNum, Category = 'Group' + CHAR(64 + DENSE_RANK() OVER( ORDER BY Attribute1,Attribute2,Attribute3))
FROM T1;

The more I learn about the rank() functions, the more impressed I am at how they can be used. Thanks Paul!

Jared
CE - Microsoft

bdba SSCommitted Points: 1504 More actions · Answer 5

Thanks a lot Paul for the tip.

I was guessing it would be a rank functions but since I never used before, could not get it to work.

Is there way in this solution to only return the items that have matching attribute with another items?

The only thing I care about is the to find the items that have matching attribute as they are duplicates in my case.

Thanks

ahicks001 Newbie Points: 5 More actions · Answer 6

ahicks001

Newbie

Points: 5

February 15, 2012 at 12:39 am

#1447146

Thanks a lot Paul for the tip.

R.P.Rozema SSChampion Points: 12317 More actions · Answer 7

Similar to Paul's suggestion:

SELECT

ItemNum,

ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) as DupNR

FROM T1;

All rows having a value higher than 1 in DupNR are duplicates of the row that has DupNR = 1 (with no preference on the row that does get DupNR = 1). Put the above query in a cte and you can use it in a delete statement directly, like so:

WITH cte AS

(

SELECT

ItemNum,

ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR

FROM T1

)

DELETE cte WHERE DupNR > 1

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

bdba SSCommitted Points: 1504 More actions · Answer 8

Hi Rozema,

I am not trying to delete but I am rather trying to display the ones that are selected as duplicates.

This requirement makes it trickier.

Thanks

R.P.Rozema (2/15/2012)
Similar to Paul's suggestion:
SELECT
ItemNum,
ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) as DupNR
FROM T1;
All rows having a value higher than 1 in DupNR are duplicates of the row that has DupNR = 1 (with no preference on the row that does get DupNR = 1). Put the above query in a cte and you can use it in a delete statement directly, like so:
WITH cte AS
(
SELECT
ItemNum,
ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR
FROM T1
)
DELETE cte WHERE DupNR > 1

R.P.Rozema SSChampion Points: 12317 More actions · Answer 9

Ok, then don't delete but just select the rows that have DupNR > 1. Or do I not understand your requirements correctly?

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

Jared SSC Guru Points: 61793 More actions · Answer 10

R.P.Rozema (2/15/2012)
Similar to Paul's suggestion:
SELECT
ItemNum,
ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) as DupNR
FROM T1;
All rows having a value higher than 1 in DupNR are duplicates of the row that has DupNR = 1 (with no preference on the row that does get DupNR = 1). Put the above query in a cte and you can use it in a delete statement directly, like so:
WITH cte AS
(
SELECT
ItemNum,
ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR
FROM T1
)
DELETE cte WHERE DupNR > 1

No, just change the DELETE to SELECT * FROM

WITH cte AS

(

SELECT

ItemNum,

ROW_NUMBER() OVER(PARTITION BY Attribute1,Attribute2,Attribute3 ORDER BY (SELECT NULL)) AS DupNR

FROM T1

)

SELECT * FROM cte WHERE DupNR > 1

Jared
CE - Microsoft

bdba SSCommitted Points: 1504 More actions · Answer 11

I also need the record that has DupNR > 1 as part of the results.

What I don't want is the rows that only have DupNR = 1, i.e. that are not duplicates based on my criteria.

That is the reason why I was trying to label them and Paul's solutions works for that.

I am trying to filter out the rows that have matching labels from Paul's solution but not able to.

Does that makes sense?

R.P.Rozema SSChampion Points: 12317 More actions · Answer 12

Combine the both solutions. Take Paul's DENSE_RANK() to generate a unique name per 'group' and add another column, using my ROW_NUMBER() suggestion to give each individual member of the group a number from 1 up to the number of members. Does that answer your questions?

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

bdba SSCommitted Points: 1504 More actions · Answer 13

Actually I did that, however I am stuck at how to only select the one that are duplicates and not the ones that are not duplicates. The row_number() function will label the same groups as 1,2,3..so on..the records are the only one in the group will have just 1 as run_number. The problem is how to only select the rows with consecutive row numbers.

Paul White SSC Guru Points: 150468 More actions · Answer 14

bdba (2/15/2012)
Actually I did that, however I am stuck at how to only select the one that are duplicates and not the ones that are not duplicates. The row_number() function will label the same groups as 1,2,3..so on..the records are the only one in the group will have just 1 as run_number. The problem is how to only select the rows with consecutive row numbers.

Please provide clear sample data and *expected output*. Any one of us can provide a good T-SQL solution if the requirement is explained well...

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi