How To Find or Remove Duplicate Groups of Records

Question

Post reply

How To Find or Remove Duplicate Groups of Records

ThomasH

SSCertifiable

Points: 7437
More actions
March 22, 2004 at 9:20 am

#107798

I have table I am "processing" from imported text data:
Create Table #TempGroups (GroupNum Int, Product Varchar(10))
I must export text data dealing with these "product" groups. I want to
greatly reduce the number of records in this table. There are many groups of
products that are identical. i.e. Group #1 contains the same 100 products as
in Group #55, so I just want to keep Group #1, and have other "parent" records
using Group #55 to point to Group #1.
How do I (quickly) identify the Groups that are the same?
I do not need to DELETE any rows, just need results so that when data is exported I can JOIN with
these results to substitute remaining Group #s.
My example Table Stats:
~37000 recs, ~1000 Groups, 260 Products in largest Group,
85 different Group sizes, 716 "Duplicate" Groups with ~28000 recs
Create Clustered Index #TempGroups on #TempGroups (GroupNum, Product)
My 1st step was to...
Create Table #TempGroupCount (GroupNum Int, Many Int)
Insert into #TempGroupCount Select GroupNum, Count(*) as Many from #TempGroups Group by GroupNum
Create Clustered Index #TempGroupCount on #TempGroupCount (GroupNum)
2nd Step to identify the Groups where ALL the Groups of the same size had the same products...
Select Outside.GroupNum, Inside.Many, Inside.MinGroup as GroupToKeep
From ( Select Many, Min(TG.GroupNum) as MinGroup,
   from #TempGroups TG
   Join #TempGroupCount TGC
    On TG.GroupNum = TGC.GroupNum
  Group by Many) Inside
Join #TempGroupCount Outside
  On Inside.Many = Outside.Many and Inside.MinGroup <> Outside.GroupNum
Where Inside.Many = Inside.Prods
This results in identifying 87 of the 716 duplicate Groups, and would let me eliminate ~10000 recs,
and runs quickly (~1 sec.).
3rd step. This one I want to be able to deal with in a "set" operation instead of a cursor / loop...
Need to identify most if not all the (remaining) duplicate groups.
I'm currently using a cursor driven by:
Select G1.GroupNum, G2.GroupNum, G1.Many
From (Select GroupNum, Many
   from #TempGroupCount
   Where GroupNum not In (Select GroupNum From [Those Groups in Step#2]) ) G1
Join (Select GroupNum, Many
   from #TempGroupCount
   Where GroupNum not In (Select GroupNum From [Those Groups in Step#2]) ) G2
  On G1.GroupNum < G2.GroupNum and G1.Many = G2.Many
Where G1.Many > 1
Then looping with Dynamic SQL Comparing Count([UNION of 2 Groups of Products]) with Count([Single Group]) etc.
The "set" SELECTS I've tried are all slower than the cursor-loop-dynamic code. They do however eliminate Step 2.
I've had to limit my loop to "Where G1.Many > 50" to remain within my 3-5 second desired processing time. The
"set" SQLs are taking 15-25+ sec. or so, so not fast enough .
So I'm looking for some forum help for ideas.
I'll email dump of my example data, just "Private Message" me.
Thanks in advance.

Once you understand the BITs, all the pieces come together

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply

ThomasH SSCertifiable Points: 7437 More actions · Answer 1

I found a solution that works for what I need.

Since the "Product" data in the #TempGroups table is numeric,

I could use this to my advantage:

Instead of (1st Step)

Insert into #TempGroupCount Select GroupNum, Count(*) as Many from #TempGroups Group by GroupNum

I used

Insert into #TempGroupCount Select GroupNum, Count(*) as Many, Sum(Convert(Int, Product)) as AllProd from #TempGroups Group by GroupNum Order by GroupNum

This resulted in more descerning records since now Groups with

differing products could most likely have rows where AllProd value was different.

Then I could change Step 2 to something like

Select Outside.GroupNum, Inside.Many, Inside.MinGroup as GroupToKeep

From ( Select Many, AllProd, Min(TG.GroupNum) as MinGroup, Count(Distinct TG.GroupNum) as Groups

from #TempGroups TG

Join #TempGroupCount TGC

On TG.GroupNum = TGC.GroupNum and Many <> 1

Group by Many, AllProd Having Count(Distinct TG.GroupNum) <> 1

&nbsp Inside

Join #TempGroupCount Outside

On Inside.Many = Outside.Many and Inside.AllProd = Outside.AllProd and Inside.MinGroup <> Outside.GroupNum

Where Inside.Many = Inside.Prods

This results in catching nearly all the duplicate Groups in my data situation,

and takes no more time to execute than the original Step 2 .

I guess the key was to try to make a #Temp... data a "better" test of difference. Even

if my Product data was not pure numeric, I could have still done something to fill my

AllProd column of data, such as maybe concatenating the MIN() and MAX() Product etc.,

anything to be able to better ID the differnces.

Once you understand the BITs, all the pieces come together