October 28, 2008 at 7:40 am
AnzioBake (10/28/2008)
The above change produces the following resultsRowIDRowKeyLocaleCode
11enA
21enA
51enC
71enA
101enA
121zhA
131zhA
141zhA
201enD
Note
the first row in the table is removed RowId= 0
Rows 12, 13, 14 are all in the result set as required
Am I still missing something in the solution?
Anzio, your approach looks good...I tested it for data integrity and looks great, its giving me the correct output but have to test it though in terms of performance later tonight when I get home. I'll keep you posted. Thanks.
October 28, 2008 at 9:26 am
Dan,
Despite your explanation, I'm still not sure I completely understand the requirements, but I now have what I think is a solution. I'm not too sure on the performance of this, but it does use an UPDATE to a table variable, which could be easily replaced with a temp table. I took the initial 3 patterns from your original post, and this code produces correct results for those 3 patterns. I don't have time to look through the other 6 pages for other test patterns, so you'll have to test this against those other patterns. I have patterns 2 and 3 commented out using the /* and */ construct:
--===== Set up the initial test data
DECLARE @data TABLE(
RowID int PRIMARY KEY CLUSTERED,
RowKey varchar(2),
Locale char(2),
Code char(1)
)
INSERT INTO @data
--===== PATTERN ONE
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','zh','A' UNION ALL
SELECT 4,'2','en','A' UNION ALL
SELECT 5,'2','en','A' UNION ALL
SELECT 10,'1','zh','A' UNION ALL
SELECT 11,'1','zh','A' UNION ALL
SELECT 15,'1','zh','C' UNION ALL
SELECT 16,'1','de','C' UNION ALL
SELECT 20,'1','en','A' UNION ALL
SELECT 21,'3','en','A' UNION ALL
SELECT 22,'3','de','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 24,'3','de','A' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
--===== PATTERN TWO
/*
SELECT 1,'1','en','D' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','zh','A' UNION ALL
SELECT 4,'1','zh','A' UNION ALL
SELECT 5,'1','zh','A' UNION ALL
SELECT 10,'1','zh','A' UNION ALL
SELECT 11,'1','zh','C' UNION ALL
SELECT 15,'1','zh','C' UNION ALL
SELECT 16,'1','de','C' UNION ALL
SELECT 20,'3','en','A' UNION ALL
SELECT 21,'3','de','A' UNION ALL
SELECT 22,'3','de','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 24,'3','de','D' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
*/
--===== PATTERN THREE
/*
SELECT 1,'5','en','D' UNION ALL
SELECT 2,'5','de','C' UNION ALL
SELECT 3,'4','zh','C' UNION ALL
SELECT 5,'5','zh','A' UNION ALL
SELECT 6,'4','zh','A' UNION ALL
SELECT 7,'3','zh','A' UNION ALL
SELECT 11,'2','zh','A' UNION ALL
SELECT 15,'1','zh','A' UNION ALL
SELECT 16,'1','zh','A' UNION ALL
SELECT 20,'2','en','C' UNION ALL
SELECT 21,'3','de','C' UNION ALL
SELECT 22,'4','de','A' UNION ALL
SELECT 23,'2','en','A' UNION ALL
SELECT 24,'4','de','A' UNION ALL
SELECT 30,'1','en','C' UNION ALL
SELECT 31,'1','en','C'
*/
--===== Set up a results table
DECLARE @EDATA TABLE(
RN int PRIMARY KEY CLUSTERED,
RowID int,
RowKey char(1),
Locale char(2),
Code char(1),
STARTS_GRP char(1),
IS_DUPE char(1),
IS_NEXT_NON_A char(1)
)
--===== Populate our results table with initial data and row numbers
INSERT INTO @EDATA (RN, RowID, RowKey, Locale, Code)
SELECT ROW_NUMBER() OVER(ORDER BY RowID) AS RN, RowID, RowKey, Locale, Code
FROM @data
--===== Update our results to indicate where groups of A's start
UPDATE E
SET STARTS_GRP = 'Y'
FROM @EDATA AS E
WHERE E.Code = 'A' AND
EXISTS (
SELECT E2.RN
FROM @EDATA AS E2
WHERE E2.Code <> 'A' AND
E2.RN = E.RN - 1
)
--===== Update our results to indicate any consecutive duplicates
UPDATE E
SET IS_DUPE = 'Y'
FROM @EDATA AS E
WHERE E.Code = 'A' AND
EXISTS (
SELECT E2.RN
FROM @EDATA AS E2
WHERE E2.Code = 'A' AND
E2.RN = E.RN - 1 AND
E2.RowKey = E.RowKey AND
E2.Locale = E.Locale
)
--===== Update our results to indicate every break from 'A'
UPDATE E
SET IS_NEXT_NON_A = 'Y'
FROM @EDATA AS E
WHERE E.Code <> 'A' AND
(EXISTS (
SELECT E2.RN
FROM @EDATA AS E2
WHERE E2.Code = 'A' AND
E2.RN = E.RN - 1
) OR
EXISTS (
SELECT E5.RN
FROM @EDATA AS E5
WHERE E5.RN = E.RN - 1 AND
E5.Code <> 'A' AND
E5.Code <> E.Code
)
)
--===== Update our results to find non-consecutive dupes
UPDATE E
SET IS_DUPE = 'Y'
FROM @EDATA AS E
WHERE Code = 'A' AND
IS_DUPE IS NULL AND
(SELECT E4.IS_DUPE FROM @EDATA AS E4 WHERE E4.RN = E.RN - 1) IS NULL AND
EXISTS (
SELECT E2.RN
FROM @EDATA AS E2
WHERE E2.Code = 'A' AND
E2.RN < E.RN - 1 AND
E2.RN >= (
SELECT MAX(E3.RN)
FROM @EDATA AS E3
WHERE E3.RN < E.RN AND
(E3.STARTS_GRP = 'Y' OR E3.IS_DUPE = 'Y')
) AND
E2.RowKey = E.RowKey AND
E2.Locale = E.Locale AND
E2.Code = E.Code
)
--===== Update our results to eliminate non-consecutive dupes that don't qualify
UPDATE E
SET IS_DUPE = NULL
FROM @EDATA AS E INNER JOIN @EDATA AS E2
ON E.RN = E2.RN + 1
WHERE E.IS_DUPE = 'Y' AND
E2.IS_DUPE = 'Y' AND
E.Code = 'A' AND
E2.Code = 'A' AND
(E.RowKey <> E2.RowKey OR
E.Locale <> E2.Locale)
--===== Select our entire results table and also the proper results rows
SELECT *
FROM @EDATA
SELECT RowID
FROM @EDATA
WHERE STARTS_GRP IS NOT NULL OR
IS_DUPE IS NOT NULL OR
IS_NEXT_NON_A IS NOT NULL
Let me know if you can test this against other patterns and see if it works. The reason I can't be sure of performance is because I timed my query after perfecting it, and the difference in milliseconds between a GETDATE() before and after the query is 0, which means the results were already in cache, and as I have no means (nor authority) to clear it, I can't generate a valid timed result until at least tomorrow.
Steve
(aka smunson)
:):):)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
October 28, 2008 at 9:32 am
UPDATE: just got lucky and generated a 13ms execution time on pattern 1. Subsequent attempts went right back to zero, despite changing to other patterns. I have no idea whether or not this is the least bit representative or not, nor how it might apply to a considerably larger test dataset.
Steve
(aka smunson)
:):):)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
October 28, 2008 at 10:01 am
HI
As I mentioned in my first post, I tested the code against a table with 1 million records and it completed in under 3 minutes, so let me know if you are getting a very different result...
October 28, 2008 at 10:58 am
I'm glad to see you guys finally slaying this beast 🙂 Wish I could have particpated more but I had to work on my personal performance to improve the integrity of my job retention. 😉
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 28, 2008 at 11:24 am
Anzio,
If you get the chance, could you test MY code against that same 1 million records? I'd love to know a) if it produces the same result, and b) what kind of performance it has. I don't have an environment available for this kind of testing. Thanks!
Steve
(aka smunson)
:):):)
AnzioBake (10/28/2008)
HIAs I mentioned in my first post, I tested the code against a table with 1 million records and it completed in under 3 minutes, so let me know if you are getting a very different result...
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
October 28, 2008 at 12:56 pm
smunson (10/28/2008)
Dan,.....
Let me know if you can test this against other patterns and see if it works. The reason I can't be sure of performance is because I timed my query after perfecting it, and the difference in milliseconds between a GETDATE() before and after the query is 0, which means the results were already in cache, and as I have no means (nor authority) to clear it, I can't generate a valid timed result until at least tomorrow.
Steve
(aka smunson)
:):):)
Hi Steve, I have tested your code on the following pattern but its giving me incorrect result.
--===== Set up the initial test data
DECLARE @data TABLE(
RowID int PRIMARY KEY CLUSTERED,
RowKey varchar(2),
Locale char(2),
Code char(1)
)
INSERT INTO @data
SELECT 1,'1','en','A' UNION ALL
SELECT 2,'2','en','A' UNION ALL
SELECT 3,'3','en','A' UNION ALL
SELECT 4,'1','en','A' UNION ALL
SELECT 5,'2','en','A' UNION ALL
SELECT 13,'3','en','A' UNION ALL
SELECT 14,'1','en','A' UNION ALL
SELECT 15,'2','en','A' UNION ALL
SELECT 23,'3','en','A' UNION ALL
SELECT 24,'1','en','A' UNION ALL
SELECT 31,'2','en','A'
---// Output/Result should be: 4,14,24
/*
SELECT 1,'1','en','A' UNION ALL
SELECT 2,'2','en','A' UNION ALL
SELECT 3,'1','en','A' UNION ALL
SELECT 4,'2','en','A' UNION ALL
SELECT 5,'1','en','A' UNION ALL
SELECT 13,'2','en','A' UNION ALL
SELECT 14,'2','en','A' UNION ALL
SELECT 15,'2','en','A' UNION ALL
SELECT 23,'4','en','A' UNION ALL
SELECT 24,'5','en','A' UNION ALL
SELECT 31,'1','en','A'
--//Output/Result should be: 3,5,14,15,
*/
Thanks for your help and time on this.
October 28, 2008 at 1:35 pm
Oooh. Do me! Do me! (New solution a few posts back.) Despite the cursor, I'd wager that it'd run *fairly* quick. Unless I screwed something up, I think I finally have the logic of this down, which isn't actually as complicated as it originally sounded.
October 28, 2008 at 3:03 pm
Garadin (10/28/2008)
Oooh. Do me! Do me! (New solution a few posts back.) Despite the cursor, I'd wager that it'd run *fairly* quick. Unless I screwed something up, I think I finally have the logic of this down, which isn't actually as complicated as it originally sounded.
Seth, your solution works great. It gives me the correct output every possible pattern that i've created. Have to test it also with performance later tonight after work.
Thanks for the help and time.
October 28, 2008 at 3:36 pm
Ok. For anyone who is reading this thread who is still somewhat confused on how this works, I'm going to attempt to break it down the way I understand it (now that I'm fairly certain I understand it correctly). One of the pivotal concepts that I think might get missed is that he *does* include the row that *is* a duplicate, but not any before it in the next search. This is how you see 3 in a row that are all marked as duplicates. The second and third are duplicates of the one directly before them, which is included in the new scan range.
To lower the number of rows I need to effectively demonstrate my point, I will use the same Locale in all example rows. Keep in mind however, that locale must match as well for this to be considered a dupe. If the row should be marked, it has a comment next to it in the following list. Rows not flagged are at the end.
RowID RowKey Locale Code Reason for Flagging
1 1'en'A
2 1'en'A- Duplicate of 1 (Range 1-2)
3 2'en'C- New Code Group
4 1'en'C
5 1'en'A- New Code Group
6 2'en'A
7 1'en'A- Duplicate of 5 (Range 5-7)
8 1'en'A- Duplicate of 7 (Range 7-8)
9 2'en'A
102'en'A- Duplicate of 9 (Range 8-10)
111'en'A
122'en'A- Duplicate of 10(Range 10-12)
131'en'C- New Code Group
141'en'C
151'en'A- New Code Group
Reasons the others were not chosen:
1. First record in table. This is the only one that does not follow the rule of first in a new code group.
4. He does not want duplicates of C's... only A's.
6. No Duplicate in the group which started at 5.
9. No match starting at 8 (when the last dupe was found)
11. No match starting at 10 (when the last dupe was found)
14. He does not want duplicates of C's... only A's.
I don't know if that is any clearer than all of his explanations and diagrams, but I figured it couldn't hurt (unless of course it's wrong)
October 28, 2008 at 6:10 pm
I believe that is true Seth, well pointed out. Thank you for helping me explain and thank you for your help.
Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.
SELECT 1,'1','en','A' UNION ALL
SELECT 2,'1','en','A' UNION ALL
SELECT 3,'1','en','A' UNION ALL
SELECT 4,'1','en','A' UNION ALL
SELECT 5,'1','en','A' UNION ALL
SELECT 6,'1','en','A' UNION ALL
....
....
.....
SELECT 199999,'1','en','A' UNION ALL
SELECT 200000,'1','en','A'
I also did come up with a solution but its pretty much slower with that scenario and quite faster otherwise. Hopefully someone can also enlighten me on how to improve the performance of this solution to the above scenario.
DECLARE @RecordCount bigint
SET @RecordCount = (SELECT COUNT(*) FROM #TempTable)
IF @RecordCount > 0
BEGIN
SELECT RowID,RowKey,Locale,Code,NULL GroupID
INTO #MyTable
FROM #TempTable
CREATE CLUSTERED INDEX Index_RowID_Clstd ON #MyTable(RowID)
DECLARE @PrevAction char(1),
@GroupID int,
@DeletedRowID int,
@PrevRowID int,
@MinRowID int,
@MinRow int,
@MaxRow int,
@NumMRange int,
@RowCnt int,
@OptimumRows int
DECLARE @RangeOfMs TABLE (
ID int IDENTITY PRIMARY KEY CLUSTERED,
MinRow int,
MaxRow int
)
DECLARE @Results TABLE (
RowID int,
Code char(1)
)
SET @PrevAction = (SELECT Code FROM #MyTable WHERE RowID = 1)
SET @GroupID = 1
UPDATE #MyTable
SET @GroupID = CASE WHEN Code <> @PrevAction THEN @GroupID + 1 ELSE @GroupID END,
GroupID = @GroupID,
@PrevAction = Code
FROM #MyTable WITH (INDEX(0))
SELECT * INTO #RankedData
FROM (SELECT *,
rank() OVER (PARTITION BY Code,GroupID ORDER BY RowID) AS MyRank
FROM #MyTable
)[Subset]
;WITH RangePerActionCTE (MinRange,MaxRange) AS (
SELECT MIN(RowID), MAX(RowID)
FROM #RankedData
WHERE Code = 'A'
GROUP BY Code,GroupID
)
INSERT @RangeOfMs
SELECT * FROM RangePerActionCTE
SET @NumMRange = (SELECT COUNT(*) FROM @RangeOfMs)
SET @DeletedRowID = -1
SET @RowCnt = 1
WHILE @RowCnt <= @NumMRange
BEGIN
SELECT
@MinRow = MinRow,
@MaxRow = MaxRow
FROM @RangeOfMs
WHERE ID = @RowCnt
SET @MinRowID = (SELECT MIN(RowID)
FROM (SELECT *,
rank() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) MyDup
FROM #MyTable
WHERE RowID BETWEEN @MinRow AND @MaxRow
AND RowID >= @DeletedRowID)[Aggregate]
WHERE MyDup > 1)
IF @MinRowID IS NOT NULL BEGIN
INSERT INTO @Results SELECT @MinRowID,'A'
SET @DeletedRowID = @MinRowID
CONTINUE;
END
ELSE BEGIN
SET @RowCnt = @RowCnt + 1
SET @DeletedRowID = -1
END
END
--//Process result
SELECT * FROM @Results
UNION ALL
SELECT RowID,Code FROM #RankedData
WHERE MyRank = 1
AND RowID > 1
ORDER BY RowID
DROP TABLE #RankedData
DROP TABLE #MyTable
END
October 29, 2008 at 1:48 am
hi Steve (SMunson)
To Test your Solution, I ran mine again as I made some changes. Execution time was 5 min 36 sec
I added a print statement after each update statement in your solution. These are the results
Process Start 2008-10-29 08:47:48
Completed Step 0 2008-10-29 08:47:55
Completed Step 1 2008-10-29 08:47:59
Completed Step 2 2008-10-29 08:48:03
Completed Step 3 2008-10-29 08:48:07
The statement has been terminated.
Query was cancelled by user.
I stopped the query after 54 min 24 sec.
October 29, 2008 at 6:02 am
AnzioBake (10/29/2008)
hi Steve (SMunson)To Test your Solution, I ran mine again as I made some changes. Execution time was 5 min 36 sec
I added a print statement after each update statement in your solution. These are the results
Process Start 2008-10-29 08:47:48
Completed Step 0 2008-10-29 08:47:55
Completed Step 1 2008-10-29 08:47:59
Completed Step 2 2008-10-29 08:48:03
Completed Step 3 2008-10-29 08:48:07
The statement has been terminated.
Query was cancelled by user.
I stopped the query after 54 min 24 sec.
For how many rows?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 29, 2008 at 6:10 am
Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.
Due to the nature of the way my solution works (haven't looked at Anzio or Steve's well enough to tell if theirs are the same), these should not affect the processing time. Did you get a chance to test the performance of any of these methods last night?
October 29, 2008 at 6:16 am
I tested the Solutions against the Test dat set that I created. The Table Contained 1 Million records with sequences ranging from 1-35 rows.
My Solution returned 670000+ Rows in the result set
Viewing 15 posts - 61 through 75 (of 115 total)
You must be logged in to reply to this topic. Login to reply