March 22, 2004 at 9:20 am
I have table I am "processing" from imported text data:
Create Table #TempGroups (GroupNum Int, Product Varchar(10))
I must export text data dealing with these "product" groups. I want to
greatly reduce the number of records in this table. There are many groups of
products that are identical. i.e. Group #1 contains the same 100 products as
in Group #55, so I just want to keep Group #1, and have other "parent" records
using Group #55 to point to Group #1.
How do I (quickly) identify the Groups that are the same?
I do not need to DELETE any rows, just need results so that when data is exported I can JOIN with
these results to substitute remaining Group #s.
My example Table Stats:
~37000 recs, ~1000 Groups, 260 Products in largest Group,
85 different Group sizes, 716 "Duplicate" Groups with ~28000 recs
Create Clustered Index #TempGroups on #TempGroups (GroupNum, Product)
My 1st step was to...
Create Table #TempGroupCount (GroupNum Int, Many Int)
Insert into #TempGroupCount Select GroupNum, Count(*) as Many from #TempGroups Group by GroupNum
Create Clustered Index #TempGroupCount on #TempGroupCount (GroupNum)
2nd Step to identify the Groups where ALL the Groups of the same size had the same products...
Select Outside.GroupNum, Inside.Many, Inside.MinGroup as GroupToKeep
From ( Select Many, Min(TG.GroupNum) as MinGroup,
from #TempGroups TG
Join #TempGroupCount TGC
On TG.GroupNum = TGC.GroupNum
Group by Many) Inside
Join #TempGroupCount Outside
On Inside.Many = Outside.Many and Inside.MinGroup <> Outside.GroupNum
Where Inside.Many = Inside.Prods
This results in identifying 87 of the 716 duplicate Groups, and would let me eliminate ~10000 recs,
and runs quickly (~1 sec.).
3rd step. This one I want to be able to deal with in a "set" operation instead of a cursor / loop...
Need to identify most if not all the (remaining) duplicate groups.
I'm currently using a cursor driven by:
Select G1.GroupNum, G2.GroupNum, G1.Many
From (Select GroupNum, Many
from #TempGroupCount
Where GroupNum not In (Select GroupNum From [Those Groups in Step#2]) ) G1
Join (Select GroupNum, Many
from #TempGroupCount
Where GroupNum not In (Select GroupNum From [Those Groups in Step#2]) ) G2
On G1.GroupNum < G2.GroupNum and G1.Many = G2.Many
Where G1.Many > 1
Then looping with Dynamic SQL Comparing Count([UNION of 2 Groups of Products]) with Count([Single Group]) etc.
The "set" SELECTS I've tried are all slower than the cursor-loop-dynamic code. They do however eliminate Step 2.
I've had to limit my loop to "Where G1.Many > 50" to remain within my 3-5 second desired processing time. The
"set" SQLs are taking 15-25+ sec. or so, so not fast enough .
So I'm looking for some forum help for ideas.
I'll email dump of my example data, just "Private Message" me.
Thanks in advance.
Once you understand the BITs, all the pieces come together
March 25, 2004 at 7:08 am
I found a solution that works for what I need.
Since the "Product" data in the #TempGroups table is numeric,
I could use this to my advantage:
Instead of (1st Step)
Insert into #TempGroupCount Select GroupNum, Count(*) as Many from #TempGroups Group by GroupNum
I used
Insert into #TempGroupCount Select GroupNum, Count(*) as Many, Sum(Convert(Int, Product)) as AllProd from #TempGroups Group by GroupNum Order by GroupNum
This resulted in more descerning records since now Groups with
differing products could most likely have rows where AllProd value was different.
Then I could change Step 2 to something like
Select Outside.GroupNum, Inside.Many, Inside.MinGroup as GroupToKeep
From ( Select Many, AllProd, Min(TG.GroupNum) as MinGroup, Count(Distinct TG.GroupNum) as Groups
from #TempGroups TG
Join #TempGroupCount TGC
On TG.GroupNum = TGC.GroupNum and Many <> 1
Group by Many, AllProd Having Count(Distinct TG.GroupNum) <> 1
  Inside
Join #TempGroupCount Outside
On Inside.Many = Outside.Many and Inside.AllProd = Outside.AllProd and Inside.MinGroup <> Outside.GroupNum
Where Inside.Many = Inside.Prods
This results in catching nearly all the duplicate Groups in my data situation,
and takes no more time to execute than the original Step 2 .
I guess the key was to try to make a #Temp... data a "better" test of difference. Even
if my Product data was not pure numeric, I could have still done something to fill my
AllProd column of data, such as maybe concatenating the MIN() and MAX() Product etc.,
anything to be able to better ID the differnces.
Once you understand the BITs, all the pieces come together
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply