October 16, 2008 at 8:37 am
Hi Everyone,
I have sample code below Item#1 which will create the sample data for testing and Item #2 which is the code to get the interleaving rows and range of unique data.
Item #1: This will create a sample table and data
use tempdb;
SET NOCOUNT ON;
GO
IF OBJECT_ID ('#TempTable','U') IS NOT NULL
BEGIN
DROP TABLE #TempTable
END
CREATE TABLE #TempTable
(
RowID int PRIMARY KEY CLUSTERED,
RowKey varchar(1000),
Locale varchar(10),
Code char(1)
)
GO
INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A
SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
GO
Item #2: This will get the interleaving rows of duplicates resulting to a unique range result.
SELECT * FROM #TempTable
DECLARE @RecordCount int;
DECLARE @MaxRowID int; -- Stores the overall maximum row id at the beginning.
-- Initialize the MaxROwID to the max row id number + 1
-- So we can set this value as the last value in the range
SET @MaxRowID = (SELECT MAX(RowID) FROM #TempTable) +1; -- Store this before we delete rows
SET @RecordCount = 0;
SELECT @RecordCount = COUNT(*) FROM #TempTable
IF @RecordCount > 0
BEGIN;
DECLARE @Results TABLE(RowID int); -- Here we assemble the range numbers (result)
DECLARE @RowID int; -- Stores the first rowid which is a duplicate
DECLARE @OptRows int; -- variable which records the optimum number of rows to scan
DECLARE @DeletedRowID int; -- Track the deleted rowids.
SET @DeletedRowID = -1
SET @OptRows = 100;
WHILE 1=1
BEGIN;
SET @RowID = (
SELECT MIN(RowID) FROM (
SELECT RowID,RowKey,
row_number() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) AS DupRank
FROM (SELECT TOP (@OptRows) *
FROM #TempTable
WHERE RowID > @DeletedRowID
ORDER BY RowID ASC
) SubsetRows
) [Aggregate]
WHERE DupRank > 1
);
RAISERROR('RowID: %d, DeletedRowID: %d, OptRows: %d', 10, 1, @RowID, @DeletedRowID, @OptRows);
IF @RowID IS NULL BEGIN;
IF @OptRows >= 2*@MaxRowID BEGIN;
INSERT INTO @Results SELECT @MaxRowID;
BREAK;
END;
ELSE
BEGIN;
SET @OptRows = @OptRows + @OptRows;
CONTINUE;
END;
END;
ELSE BEGIN;
INSERT INTO @Results SELECT @RowID;
SET @DeletedRowID = @RowID - 1;
SET @OptRows = @OptRows - 1;
END;
END;
END;
SELECT * FROM @Results;
GO
My Objectives:
1.) Using Item #2, I need to process data for every batch of A (see Item #1 table data per batch), the problem with Item #2 is it will process the whole data without looking on every A batch.
2.) Aside from processing data for every batch of A, I need to make (Item #2) more faster on certain patterns of data for every batch of A. Since it is much slower when there are consecutive concentrated duplicates in the data like for example pattern (Item #3) below:
Item#3:
INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 4,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 11,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 18,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
Low performance numbers will be seen if tested on 750,000 records or 1 Million records using the pattern above. It will actually take about 2 hours+ by actually processing only 200k using the pattern (Item#3) above but If there aren't that much consecutive duplicates like (Item #1), It would only take 28 minutes to process the 1 million data.
Thank you for your time and help.
October 16, 2008 at 4:01 pm
Thanks for posting the sample data and DML.
I'm not clear on what you want the output to be. If you can post that maybe we can come up with a way to do this without having to do this in a loop and instead do it set based.
Gary Johnson
Sr Database Engineer
October 16, 2008 at 10:51 pm
Gary Johnson (10/16/2008)
Thanks for posting the sample data and DML.I'm not clear on what you want the output to be. If you can post that maybe we can come up with a way to do this without having to do this in a loop and instead do it set based.
Thank you for your prompt response Gary. Sorry I forgot to show you the output for the following data sample.
For this data sample, Item #1:
INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A
SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A <- dup on RowID 10 (checked)
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)
SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
Final Output on Item #1 should be:
SELECT 3 as RowID, 'A' as Code UNION ALL
SELECT 11 as RowID, 'A' as Code UNION ALL
SELECT 12 as RowID, 'C' as Code UNION ALL
SELECT 14 as RowID, 'A' as Code UNION ALL
SELECT 18 as RowID, 'A' as Code UNION ALL
SELECT 20 as RowID, 'C'
To explain on how did I come up with the final output on Item #1.
First, I need to get the interleaving rows per Code so it should give me the output of:
SELECT 3 as RowID, 'A' as Code UNION ALL
SELECT 12 as RowID, 'C' as Code UNION ALL
SELECT 14 as RowID, 'A' as Code UNION ALL
SELECT 20 as RowID, 'C'
Next is I need to process the interleaving rows of duplicates per batch of A's that would result on a unique range result. So for the first batch of A, it should give me the output.
SELECT 11 as RowID, 'A'
and the second batch of A will have an output:
SELECT 18 as RowID, 'A'
So that is how I came up with the final output. I put some comments on Item #1 hoping it would help understand the process.
Thank you for your time and help.
October 17, 2008 at 12:43 pm
Hi,
I also have an attachment here named demo to help understand on what I'm about to achieve or help better support my objectives.
-dans
October 17, 2008 at 1:53 pm
Great image. That goes a lot farther towards explaining what you are after. This seems like it'll be a very interesting query to write. I haven't the slightest idea of *why* you would want something like that, but at least now I think I grasp what you're looking for.
October 17, 2008 at 2:40 pm
OK. This will be attempt #1 at a set based solution. I'm not 100% sure at the speed of this, as the index does not fully cover the subquery in the second part. I'll need you to check that out and get back with us on it if possible, as it's an issue we're currently looking into. If it still runs fairly slow(in this case, > 5 minutes), I'm pretty sure we can speed it up.
The logic of this is simple. I scan through your table and break things into groups by finding changes of Code.
I then go back through, and analyze that group for any duplicates.
I take those duplicates and the first member of each group and make it the output.
If I'm wrong on any part of that logic, please correct me. (I may not answer right away, as I'll probably be out tonight, but someone else here can probably take this one from here, or I'll answer when I get back). This solution is based in part on the running total technique described here:
http://www.sqlservercentral.com/articles/Advanced+Querying/61716/ (Yeah, I know, I kinda get fixated on certain methods and use them everywhere for a while, sorry Jeff 😉 )
As part of this solution, I've added two fields to your table, GroupID and Dupe, both INT's.
[font="Courier New"]------ Declare Variables ----------------------------------
DECLARE @GroupID INT,
@PrevCode CHAR(5),
@RowID INT
SET @RowID = 0
SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)
SET @GroupID = 1
-----------------------------------------------------------
-- Breaks your data out into groups so it can be scanned --
UPDATE #TempTable
SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,
GroupID = @GroupID,
@PrevCode = Code
FROM #TempTable WITH (INDEX(0))
-----------------------------------------------------------
------------- Goes through your data and flags Dupes ------
UPDATE T1
SET Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2
WHERE T2.GroupID = T1.GroupID AND
T2.Locale = T1.Locale AND
T2.RowKey = T1.RowKey AND
T2.RowID < T1.RowID) > 0 THEN 1 ELSE 0 END,
@RowID = RowID
FROM #Temptable T1 WITH INDEX(0)
-----------------------------------------------------------
---------- Selects The first member of each group (which is what I took as process 1) and
---------- any rows marked as duplicates and displays them
SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code
FROM #TempTable T1
LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1
LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow
WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL
[/font]
October 17, 2008 at 5:06 pm
Here's a set based solution using CTEs. Please let us see the execution plans and performance statistics when you have tested them. Best of luck.
Bob
------------------------------------------------
-- CTE Solution
------------------------------------------------
;with cte1 as -- identify breaks
(select code,rowid
from #temptable t1
where t1.code <> (select top 1 code
from #temptable t2
where t2.rowID < t1.rowID
order by t2.rowID desc)
)
,cte2 as -- identify ranges for 'A' codes only
(select c1.code,c1.rowid as minRow,min(c1a.rowID) as maxRow
from cte1 c1
join cte1 c1a on c1a.code <> c1.code and c1a.rowID > c1.rowID
where c1.code = 'A'
group by c1.code,c1.rowID
)
,cte3 as -- identify duplicates in 'A' ranges
(select c.minrow,max(rowid) as dupID
from #temptable t
cross join cte2 c
where t.rowid >=minRow and t.rowID < maxRow
group by c.minRow,t.rowKey,t.locale,c.code
having count(*) > 1
)
,cte4 as -- just get the *first* duplicates
(select min(dupID) as rowID
from cte3
group by minRow
)
,cte5 as -- put the breaks and the first "A" dups together
(select rowID from cte1
union all
select rowID from cte4)
select t.*
from cte5 c
join #temptable t on t.rowID = c.rowID
order by c.rowID
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 17, 2008 at 9:17 pm
Garadin (10/17/2008)
OK. This will be attempt #1 at a set based solution. I'm not 100% sure at the speed of this, as the index does not fully cover the subquery in the second part. I'll need you to check that out and get back with us on it if possible, as it's an issue we're currently looking into. If it still runs fairly slow(in this case, > 5 minutes), I'm pretty sure we can speed it up....
I can't seem to test your solution since it gave me this error...
Msg 1018, Level 15, State 1, Line 28
Incorrect syntax near 'INDEX'. If this is intended as a part of a table hint, A WITH keyword and parenthesis are now required. See SQL Server Books Online for proper syntax.
Any idea? I'm sure i am missing something here..:(
October 17, 2008 at 9:25 pm
bhovious (10/17/2008)
Here's a set based solution using CTEs. Please let us see the execution plans and performance statistics when you have tested them. Best of luck.Bob
I tested your solution for data integrity but it seems it is giving me the wrong output on some other patterns. It is correct on Item #1 data that I supplied but for some other patterns it isn't, there is probably a problem with the logic. The image that I attached on my previous reply would explain how to get the result, I also tried the sample data on that image presentation (attached) and the output is lacking using your suggested solution :(.
Thank you so much for the help.
Here are some other patterns that might help on testing data integrity, logic in getting the output is still the same as stated in the attached image on my previous post.
Pattern #1
INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','zh','A' UNION ALL
SELECT 4,'2','en','A' UNION ALL
SELECT 5,'2','en','A' UNION ALL
SELECT 10,'1','zh','A' UNION ALL
SELECT 11,'1','zh','A' UNION ALL
SELECT 15,'1','zh','C' UNION ALL
SELECT 16,'1','de','C' UNION ALL
SELECT 20,'1','en','A' UNION ALL
SELECT 21,'3','en','A' UNION ALL
SELECT 22,'3','de','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 24,'3','de','A' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
select * from #temptable
--//Output
SELECT 3 as RowID,'1' as RowKey,'zh' as Locale,'A' as [Action] UNION ALL
SELECT 5,'2','en','A' UNION ALL
SELECT 11,'1','zh','A' UNION ALL
SELECT 15,'1','zh','C' UNION ALL
SELECT 20,'1','en','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 30,'1','en','C'
Pattern #2
INSERT #TempTable
SELECT 1,'1','en','D' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','zh','A' UNION ALL
SELECT 4,'1','zh','A' UNION ALL
SELECT 5,'1','zh','A' UNION ALL
SELECT 10,'1','zh','A' UNION ALL
SELECT 11,'1','zh','C' UNION ALL
SELECT 15,'1','zh','C' UNION ALL
SELECT 16,'1','de','C' UNION ALL
SELECT 20,'3','en','A' UNION ALL
SELECT 21,'3','de','A' UNION ALL
SELECT 22,'3','de','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 24,'3','de','D' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
select * from #temptable
--//Output
SELECT 2 as RowID,'1' as RowKey,'de' as Locale,'C' as [Action] UNION ALL
SELECT 3,'1','zh','A' UNION ALL
SELECT 4,'1','zh','A' UNION ALL
SELECT 5,'1','zh','A' UNION ALL
SELECT 10,'1','zh','A' UNION ALL
SELECT 11,'1','zh','C' UNION ALL
SELECT 20,'3','en','A' UNION ALL
SELECT 22,'3','de','A' UNION ALL
SELECT 24,'3','de','D' UNION ALL
SELECT 30,'1','en','C'
Pattern #3
INSERT #TempTable
SELECT 1,'5','en','D' UNION ALL
SELECT 2,'5','de','C' UNION ALL
SELECT 3,'4','zh','C' UNION ALL
SELECT 5,'5','zh','A' UNION ALL
SELECT 6,'4','zh','A' UNION ALL
SELECT 7,'3','zh','A' UNION ALL
SELECT 11,'2','zh','A' UNION ALL
SELECT 15,'1','zh','A' UNION ALL
SELECT 16,'1','zh','A' UNION ALL
SELECT 20,'2','en','C' UNION ALL
SELECT 21,'3','de','C' UNION ALL
SELECT 22,'4','de','A' UNION ALL
SELECT 23,'2','en','A' UNION ALL
SELECT 24,'4','de','A' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
select * from #temptable
--//Output
SELECT 2 as RowID,'5' as RowKey,'de' as Locale,'C' as [Action] UNION ALL
SELECT 5,'5','zh','A' UNION ALL
SELECT 16,'1','zh','A' UNION ALL
SELECT 20,'2','en','C' UNION ALL
SELECT 22,'4','de','A' UNION ALL
SELECT 24,'4','de','A' UNION ALL
SELECT 30,'1','en','C'
October 17, 2008 at 10:39 pm
Using the data at the bottom of this message, the CTE based code produces the following:
------------------------------------------------------------------------------------------------
3 5 zh A
112zhA
121zhC
141enA
181enA
201enC
------------------------------------------------------------------------------------------------
This corresponds to your expected output as specified in your second post of this thread. All you have to do is remove rowKey and locale from the final select statement.
SELECT 3 as RowID, 'A' as Code UNION ALL
SELECT 11 as RowID, 'A' as Code UNION ALL
SELECT 12 as RowID, 'C' as Code UNION ALL
SELECT 14 as RowID, 'A' as Code UNION ALL
SELECT 18 as RowID, 'A' as Code UNION ALL
SELECT 20 as RowID, 'C'
I have tested it against one of the patterns you just sent and found a discrepancy there which can be fixed, but first let me ask: Will you ever have THREE matching rows in a series of A's like this:
11 2 zh A
12 2 zh A
13 2 zh A
If so, which rowID should be returned, 12 or 13 ??
-----------------------------------------------------------------------
rowID rowKey locale code
11enC
21deC
35zhA
44enA
53enA
102zhA
112zhA
121zhC
131deC
141enA
153enA
163deA
181enA
193deA
201enC
211enC
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 17, 2008 at 11:16 pm
bhovious (10/17/2008)
I have tested it against one of the patterns you just sent and found a discrepancy there which can be fixed, but first let me ask: Will you ever have THREE matching rows in a series of A's like this:
11 2 zh A
12 2 zh A
13 2 zh A
If so, which rowID should be returned, 12 or 13 ??
Yes it is possible, so if you have that kind of combination inside A range then it should return both 12 & 13. It can even have more than that, its what I called a concentrated consecutive duplicates which causes my solution previously (Item #2 - original post) to slow down on 1 million or more data. So this may happen in a pattern:
11 1 de A
13 1 de A
15 1 de A
16 1 de A
20 1 zh A
21 1 de A
22 1 zh A
which will have an output of 13,15,16, and 21.
thanks for your help.
October 18, 2008 at 12:17 am
Interesting... I swore it worked when I ran it at the office. I don't even remember adding the WITH INDEX clause to the second piece. In any case, it needs to be wrapped in an additional set of parens.
WITH INDEX(0)
Needs to be:
WITH (INDEX(0))
October 20, 2008 at 7:55 am
Garadin (10/18/2008)
Interesting... I swore it worked when I ran it at the office. I don't even remember adding the WITH INDEX clause to the second piece. In any case, it needs to be wrapped in an additional set of parens.WITH INDEX(0)
Needs to be:
WITH (INDEX(0))
You're right Garadin, there was one line on your code which has:
WITH INDEX(0)
Needs to be:
WITH (INDEX(0))
I tested it for data integrity but it's not giving me the correct result, there some discrepancies.
I have this example which has an explanation on the attachment (demo.gif) on
how to get its final result.
Here is the sample running code which I tested.
use tempdb;
SET NOCOUNT ON;
GO
IF OBJECT_ID ('#TempTable','U') IS NOT NULL
BEGIN
DROP TABLE #TempTable
END
CREATE TABLE #TempTable
(
RowID int PRIMARY KEY CLUSTERED,
RowKey varchar(1000),
Locale varchar(10),
Code char(1),
GroupID int,
Dupe int
)
GO
INSERT #TempTable(RowID,RowKey,Locale,Code)
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','zh','A' UNION ALL
SELECT 4,'2','en','A' UNION ALL
SELECT 5,'2','en','A' UNION ALL
SELECT 10,'1','zh','A' UNION ALL
SELECT 11,'1','zh','A' UNION ALL
SELECT 15,'1','zh','C' UNION ALL
SELECT 16,'1','de','C' UNION ALL
SELECT 20,'1','en','A' UNION ALL
SELECT 21,'3','en','A' UNION ALL
SELECT 22,'3','de','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 24,'3','de','A' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
GO
SELECT * FROM #TempTable
------ Declare Variables ----------------------------------
DECLARE @GroupID INT,
@PrevCode CHAR(5),
@RowID INT
SET @RowID = 0
SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)
SET @GroupID = 1
-----------------------------------------------------------
-- Breaks your data out into groups so it can be scanned --
UPDATE #TempTable
SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,
GroupID = @GroupID,
@PrevCode = Code
FROM #TempTable WITH (INDEX(0))
-----------------------------------------------------------
------------- Goes through your data and flags Dupes ------
UPDATE T1
SET Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2
WHERE T2.GroupID = T1.GroupID AND
T2.Locale = T1.Locale AND
T2.RowKey = T1.RowKey AND
T2.RowID 0 THEN 1 ELSE 0 END,
@RowID = RowID
FROM #Temptable T1 WITH (INDEX(0))
-----------------------------------------------------------
---------- Selects The first member of each group (which is what I took as process 1) and
---------- any rows marked as duplicates and displays them
SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code
FROM #TempTable T1
LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1
LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow
WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL
The idea in getting the result inside A range is, lets say I have this sample
1 2 en A
2 2 en A
3 2 en A
5 2 zh A
6 3 de A
10 3 zh A
11 3 de A
12 3 en A
13 3 zh A
It should have the final output:
2 2 en A -- This is a dupe from (1 2 en A)
3 2 en A -- This is a dupe from (2 2 en A)
11 3 de A -- This is a dupe from (6 3 de A)
Probably you might be wondering why (13 3 zh A) wasn't included in the final result.
That is because there is no duplicate of (13 3 zh A), starting from the last dupe found which is (11 3 de A)
to (13 3 zh A). It should always start looking for duplicates from the last duplicate found onwards.
Hope this helps.
Thank you for your help.
October 20, 2008 at 9:24 am
Ok, so a slight change in logic. How about this one?
[font="Courier New"]
------ Declare Variables ----------------------------------
DECLARE @GroupID INT,
@PrevCode CHAR(5),
@RowID INT,
@LastDupe INT,
@Dupe TinyInt
SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)
SET @GroupID = 1
SET @LastDupe = 1
-----------------------------------------------------------
-- Breaks your data out into groups so it can be scanned --
UPDATE #TempTable
SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,
GroupID = @GroupID,
@PrevCode = Code
FROM #TempTable WITH (INDEX(0))
-----------------------------------------------------------
------------- Goes through your data and flags Dupes ------
UPDATE T1
SET @Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2
WHERE T2.GroupID = T1.GroupID AND
T2.Locale = T1.Locale AND
T2.RowKey = T1.RowKey AND
T2.RowID < T1.RowID AND
T2.RowID > @LastDupe) > 0 THEN 1 ELSE 0 END,
Dupe = @Dupe,
@LastDupe = CASE WHEN @Dupe = 1 THEN RowID ELSE @LastDupe END,
@RowID = RowID
FROM #Temptable T1 WITH (INDEX(0))
-----------------------------------------------------------
---------- Selects The first member of each group (which is what I took as process 1) and
---------- any rows marked as duplicates and displays them
SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code, T1.GroupID
FROM #TempTable T1
LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1
LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow
WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL[/font]
Using that code gives me the following results on your test data:
RESULTS
-----------------
35zhA2
112zhA2
121zhC3
141enA4
181enA4
193deA4
201enC5
211enC5
If these are incorrect, point out which ones should or should not be there / why.
October 20, 2008 at 9:44 am
Garadin (10/20/2008)
Using that code gives me the following results on your test data:
RESULTS
-----------------
35zhA2
112zhA2
121zhC3
141enA4
181enA4
193deA4
201enC5
211enC5
If these are incorrect, point out which ones should or should not be there / why.
I think you're using this data, the first post that I gave:
INSERT #TempTable(RowID,RowKey,Locale,Code)
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A
SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
The results are incorrect, it should be:
CORRECT RESULTS
------------------
35zhA2
112zhA2
121zhC3
141enA4
181enA4
201enC5
These two should not be included because...
19 3deA4
-- you should look for duplicates from the last duplicate found and in this case the last duplicate found before (193deA4) is (18
1enA4)
21 1enC5
-- I just want to look at duplicates on range of A's and not C's as explained on the attached image and previous post.
Viewing 15 posts - 1 through 15 (of 115 total)
You must be logged in to reply to this topic. Login to reply