Need help on performance and integrity

Question

Need help on performance and integrity

Viewing 15 posts - 61 through 75 (of 115 total)

You must be logged in to reply to this topic. Login to reply

Dan Segalles SSC-Addicted Points: 412 More actions · Answer 1

AnzioBake (10/28/2008)
The above change produces the following results
RowIDRowKeyLocaleCode
11enA
21enA
51enC
71enA
101enA
121zhA
131zhA
141zhA
201enD
Note
the first row in the table is removed RowId= 0
Rows 12, 13, 14 are all in the result set as required
Am I still missing something in the solution?

Anzio, your approach looks good...I tested it for data integrity and looks great, its giving me the correct output but have to test it though in terms of performance later tonight when I get home. I'll keep you posted. Thanks.

sgmunson SSC Guru Points: 110640 More actions · Answer 2

Dan,

Despite your explanation, I'm still not sure I completely understand the requirements, but I now have what I think is a solution. I'm not too sure on the performance of this, but it does use an UPDATE to a table variable, which could be easily replaced with a temp table. I took the initial 3 patterns from your original post, and this code produces correct results for those 3 patterns. I don't have time to look through the other 6 pages for other test patterns, so you'll have to test this against those other patterns. I have patterns 2 and 3 commented out using the /* and */ construct:

--===== Set up the initial test data

DECLARE @data TABLE(

RowID int PRIMARY KEY CLUSTERED,

RowKey varchar(2),

Locale char(2),

Code char(1)

)

INSERT INTO @data

--===== PATTERN ONE

SELECT 1,'1','en','C' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'1','zh','A' UNION ALL

SELECT 4,'2','en','A' UNION ALL

SELECT 5,'2','en','A' UNION ALL

SELECT 10,'1','zh','A' UNION ALL

SELECT 11,'1','zh','A' UNION ALL

SELECT 15,'1','zh','C' UNION ALL

SELECT 16,'1','de','C' UNION ALL

SELECT 20,'1','en','A' UNION ALL

SELECT 21,'3','en','A' UNION ALL

SELECT 22,'3','de','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 24,'3','de','A' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

--===== PATTERN TWO

/*

SELECT 1,'1','en','D' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'1','zh','A' UNION ALL

SELECT 4,'1','zh','A' UNION ALL

SELECT 5,'1','zh','A' UNION ALL

SELECT 10,'1','zh','A' UNION ALL

SELECT 11,'1','zh','C' UNION ALL

SELECT 15,'1','zh','C' UNION ALL

SELECT 16,'1','de','C' UNION ALL

SELECT 20,'3','en','A' UNION ALL

SELECT 21,'3','de','A' UNION ALL

SELECT 22,'3','de','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 24,'3','de','D' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

*/

--===== PATTERN THREE

/*

SELECT 1,'5','en','D' UNION ALL

SELECT 2,'5','de','C' UNION ALL

SELECT 3,'4','zh','C' UNION ALL

SELECT 5,'5','zh','A' UNION ALL

SELECT 6,'4','zh','A' UNION ALL

SELECT 7,'3','zh','A' UNION ALL

SELECT 11,'2','zh','A' UNION ALL

SELECT 15,'1','zh','A' UNION ALL

SELECT 16,'1','zh','A' UNION ALL

SELECT 20,'2','en','C' UNION ALL

SELECT 21,'3','de','C' UNION ALL

SELECT 22,'4','de','A' UNION ALL

SELECT 23,'2','en','A' UNION ALL

SELECT 24,'4','de','A' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

*/

--===== Set up a results table

DECLARE @EDATA TABLE(

RN int PRIMARY KEY CLUSTERED,

RowID int,

RowKey char(1),

Locale char(2),

Code char(1),

STARTS_GRP char(1),

IS_DUPE char(1),

IS_NEXT_NON_A char(1)

)

--===== Populate our results table with initial data and row numbers

INSERT INTO @EDATA (RN, RowID, RowKey, Locale, Code)

SELECT ROW_NUMBER() OVER(ORDER BY RowID) AS RN, RowID, RowKey, Locale, Code

FROM @data

--===== Update our results to indicate where groups of A's start

UPDATE E

SET STARTS_GRP = 'Y'

FROM @EDATA AS E

WHERE E.Code = 'A' AND

EXISTS (

SELECT E2.RN

FROM @EDATA AS E2

WHERE E2.Code <> 'A' AND

E2.RN = E.RN - 1

)

--===== Update our results to indicate any consecutive duplicates

UPDATE E

SET IS_DUPE = 'Y'

FROM @EDATA AS E

WHERE E.Code = 'A' AND

EXISTS (

SELECT E2.RN

FROM @EDATA AS E2

WHERE E2.Code = 'A' AND

E2.RN = E.RN - 1 AND

E2.RowKey = E.RowKey AND

E2.Locale = E.Locale

)

--===== Update our results to indicate every break from 'A'

UPDATE E

SET IS_NEXT_NON_A = 'Y'

FROM @EDATA AS E

WHERE E.Code <> 'A' AND

(EXISTS (

SELECT E2.RN

FROM @EDATA AS E2

WHERE E2.Code = 'A' AND

E2.RN = E.RN - 1

) OR

EXISTS (

SELECT E5.RN

FROM @EDATA AS E5

WHERE E5.RN = E.RN - 1 AND

E5.Code <> 'A' AND

E5.Code <> E.Code

)

--===== Update our results to find non-consecutive dupes

UPDATE E

SET IS_DUPE = 'Y'

FROM @EDATA AS E

WHERE Code = 'A' AND

IS_DUPE IS NULL AND

(SELECT E4.IS_DUPE FROM @EDATA AS E4 WHERE E4.RN = E.RN - 1) IS NULL AND

EXISTS (

SELECT E2.RN

FROM @EDATA AS E2

WHERE E2.Code = 'A' AND

E2.RN < E.RN - 1 AND

E2.RN >= (

SELECT MAX(E3.RN)

FROM @EDATA AS E3

WHERE E3.RN < E.RN AND

(E3.STARTS_GRP = 'Y' OR E3.IS_DUPE = 'Y')

) AND

E2.RowKey = E.RowKey AND

E2.Locale = E.Locale AND

E2.Code = E.Code

)

--===== Update our results to eliminate non-consecutive dupes that don't qualify

UPDATE E

SET IS_DUPE = NULL

FROM @EDATA AS E INNER JOIN @EDATA AS E2

ON E.RN = E2.RN + 1

WHERE E.IS_DUPE = 'Y' AND

E2.IS_DUPE = 'Y' AND

E.Code = 'A' AND

E2.Code = 'A' AND

(E.RowKey <> E2.RowKey OR

E.Locale <> E2.Locale)

--===== Select our entire results table and also the proper results rows

SELECT *

FROM @EDATA

SELECT RowID

FROM @EDATA

WHERE STARTS_GRP IS NOT NULL OR

IS_DUPE IS NOT NULL OR

IS_NEXT_NON_A IS NOT NULL

Let me know if you can test this against other patterns and see if it works. The reason I can't be sure of performance is because I timed my query after perfecting it, and the difference in milliseconds between a GETDATE() before and after the query is 0, which means the results were already in cache, and as I have no means (nor authority) to clear it, I can't generate a valid timed result until at least tomorrow.

Steve

(aka smunson)

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

sgmunson SSC Guru Points: 110640 More actions · Answer 3

UPDATE: just got lucky and generated a 13ms execution time on pattern 1. Subsequent attempts went right back to zero, despite changing to other patterns. I have no idea whether or not this is the least bit representative or not, nor how it might apply to a considerably larger test dataset.

Steve

(aka smunson)

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 4

HI

As I mentioned in my first post, I tested the code against a table with 1 million records and it completed in under 3 minutes, so let me know if you are getting a very different result...

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

I'm glad to see you guys finally slaying this beast 🙂 Wish I could have particpated more but I had to work on my personal performance to improve the integrity of my job retention. 😉

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

sgmunson SSC Guru Points: 110640 More actions · Answer 6

Anzio,

If you get the chance, could you test MY code against that same 1 million records? I'd love to know a) if it produces the same result, and b) what kind of performance it has. I don't have an environment available for this kind of testing. Thanks!

Steve

(aka smunson)

:):):)

AnzioBake (10/28/2008)
HI
As I mentioned in my first post, I tested the code against a table with 1 million records and it completed in under 3 minutes, so let me know if you are getting a very different result...

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Dan Segalles SSC-Addicted Points: 412 More actions · Answer 7

smunson (10/28/2008)
Dan,
.....
Let me know if you can test this against other patterns and see if it works. The reason I can't be sure of performance is because I timed my query after perfecting it, and the difference in milliseconds between a GETDATE() before and after the query is 0, which means the results were already in cache, and as I have no means (nor authority) to clear it, I can't generate a valid timed result until at least tomorrow.
Steve
(aka smunson)
:):):)

Hi Steve, I have tested your code on the following pattern but its giving me incorrect result.

--===== Set up the initial test data

DECLARE @data TABLE(

RowID int PRIMARY KEY CLUSTERED,

RowKey varchar(2),

Locale char(2),

Code char(1)

)

INSERT INTO @data

SELECT 1,'1','en','A' UNION ALL

SELECT 2,'2','en','A' UNION ALL

SELECT 3,'3','en','A' UNION ALL

SELECT 4,'1','en','A' UNION ALL

SELECT 5,'2','en','A' UNION ALL

SELECT 13,'3','en','A' UNION ALL

SELECT 14,'1','en','A' UNION ALL

SELECT 15,'2','en','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 24,'1','en','A' UNION ALL

SELECT 31,'2','en','A'

---// Output/Result should be: 4,14,24

/*

SELECT 1,'1','en','A' UNION ALL

SELECT 2,'2','en','A' UNION ALL

SELECT 3,'1','en','A' UNION ALL

SELECT 4,'2','en','A' UNION ALL

SELECT 5,'1','en','A' UNION ALL

SELECT 13,'2','en','A' UNION ALL

SELECT 14,'2','en','A' UNION ALL

SELECT 15,'2','en','A' UNION ALL

SELECT 23,'4','en','A' UNION ALL

SELECT 24,'5','en','A' UNION ALL

SELECT 31,'1','en','A'

--//Output/Result should be: 3,5,14,15,

*/

Thanks for your help and time on this.

Garadin One Orange Chip Points: 29613 More actions · Answer 8

Oooh. Do me! Do me! (New solution a few posts back.) Despite the cursor, I'd wager that it'd run *fairly* quick. Unless I screwed something up, I think I finally have the logic of this down, which isn't actually as complicated as it originally sounded.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Dan Segalles SSC-Addicted Points: 412 More actions · Answer 9

Garadin (10/28/2008)
Oooh. Do me! Do me! (New solution a few posts back.) Despite the cursor, I'd wager that it'd run *fairly* quick. Unless I screwed something up, I think I finally have the logic of this down, which isn't actually as complicated as it originally sounded.

Seth, your solution works great. It gives me the correct output every possible pattern that i've created. Have to test it also with performance later tonight after work.

Thanks for the help and time.

Garadin One Orange Chip Points: 29613 More actions · Answer 10

Ok. For anyone who is reading this thread who is still somewhat confused on how this works, I'm going to attempt to break it down the way I understand it (now that I'm fairly certain I understand it correctly). One of the pivotal concepts that I think might get missed is that he *does* include the row that *is* a duplicate, but not any before it in the next search. This is how you see 3 in a row that are all marked as duplicates. The second and third are duplicates of the one directly before them, which is included in the new scan range.

To lower the number of rows I need to effectively demonstrate my point, I will use the same Locale in all example rows. Keep in mind however, that locale must match as well for this to be considered a dupe. If the row should be marked, it has a comment next to it in the following list. Rows not flagged are at the end.

RowID RowKey Locale Code Reason for Flagging

1 1'en'A

2 1'en'A- Duplicate of 1 (Range 1-2)

3 2'en'C- New Code Group

4 1'en'C

5 1'en'A- New Code Group

6 2'en'A

7 1'en'A- Duplicate of 5 (Range 5-7)

8 1'en'A- Duplicate of 7 (Range 7-8)

9 2'en'A

102'en'A- Duplicate of 9 (Range 8-10)

111'en'A

122'en'A- Duplicate of 10(Range 10-12)

131'en'C- New Code Group

141'en'C

151'en'A- New Code Group

Reasons the others were not chosen:

1. First record in table. This is the only one that does not follow the rule of first in a new code group.

4. He does not want duplicates of C's... only A's.

6. No Duplicate in the group which started at 5.

9. No match starting at 8 (when the last dupe was found)

11. No match starting at 10 (when the last dupe was found)

14. He does not want duplicates of C's... only A's.

I don't know if that is any clearer than all of his explanations and diagrams, but I figured it couldn't hurt (unless of course it's wrong)

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Dan Segalles SSC-Addicted Points: 412 More actions · Answer 11

I believe that is true Seth, well pointed out. Thank you for helping me explain and thank you for your help.

Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.

SELECT 1,'1','en','A' UNION ALL

SELECT 2,'1','en','A' UNION ALL

SELECT 3,'1','en','A' UNION ALL

SELECT 4,'1','en','A' UNION ALL

SELECT 5,'1','en','A' UNION ALL

SELECT 6,'1','en','A' UNION ALL

....

.....

SELECT 199999,'1','en','A' UNION ALL

SELECT 200000,'1','en','A'

I also did come up with a solution but its pretty much slower with that scenario and quite faster otherwise. Hopefully someone can also enlighten me on how to improve the performance of this solution to the above scenario.

DECLARE @RecordCount bigint

SET @RecordCount = (SELECT COUNT(*) FROM #TempTable)

IF @RecordCount > 0

BEGIN

SELECT RowID,RowKey,Locale,Code,NULL GroupID

INTO #MyTable

FROM #TempTable

CREATE CLUSTERED INDEX Index_RowID_Clstd ON #MyTable(RowID)

DECLARE @PrevAction char(1),

@GroupID int,

@DeletedRowID int,

@PrevRowID int,

@MinRowID int,

@MinRow int,

@MaxRow int,

@NumMRange int,

@RowCnt int,

@OptimumRows int

DECLARE @RangeOfMs TABLE (

ID int IDENTITY PRIMARY KEY CLUSTERED,

MinRow int,

MaxRow int

)

DECLARE @Results TABLE (

RowID int,

Code char(1)

)

SET @PrevAction = (SELECT Code FROM #MyTable WHERE RowID = 1)

SET @GroupID = 1

UPDATE #MyTable

SET @GroupID = CASE WHEN Code <> @PrevAction THEN @GroupID + 1 ELSE @GroupID END,

GroupID = @GroupID,

@PrevAction = Code

FROM #MyTable WITH (INDEX(0))

SELECT * INTO #RankedData

FROM (SELECT *,

rank() OVER (PARTITION BY Code,GroupID ORDER BY RowID) AS MyRank

FROM #MyTable

)[Subset]

;WITH RangePerActionCTE (MinRange,MaxRange) AS (

SELECT MIN(RowID), MAX(RowID)

FROM #RankedData

WHERE Code = 'A'

GROUP BY Code,GroupID

)

INSERT @RangeOfMs

SELECT * FROM RangePerActionCTE

SET @NumMRange = (SELECT COUNT(*) FROM @RangeOfMs)

SET @DeletedRowID = -1

SET @RowCnt = 1

WHILE @RowCnt <= @NumMRange

BEGIN

SELECT

@MinRow = MinRow,

@MaxRow = MaxRow

FROM @RangeOfMs

WHERE ID = @RowCnt

SET @MinRowID = (SELECT MIN(RowID)

FROM (SELECT *,

rank() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) MyDup

FROM #MyTable

WHERE RowID BETWEEN @MinRow AND @MaxRow

AND RowID >= @DeletedRowID)[Aggregate]

WHERE MyDup > 1)

IF @MinRowID IS NOT NULL BEGIN

INSERT INTO @Results SELECT @MinRowID,'A'

SET @DeletedRowID = @MinRowID

CONTINUE;

END

ELSE BEGIN

SET @RowCnt = @RowCnt + 1

SET @DeletedRowID = -1

END

--//Process result

SELECT * FROM @Results

UNION ALL

SELECT RowID,Code FROM #RankedData

WHERE MyRank = 1

AND RowID > 1

ORDER BY RowID

DROP TABLE #RankedData

DROP TABLE #MyTable

END

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 12

hi Steve (SMunson)

To Test your Solution, I ran mine again as I made some changes. Execution time was 5 min 36 sec

I added a print statement after each update statement in your solution. These are the results

Process Start 2008-10-29 08:47:48

Completed Step 0 2008-10-29 08:47:55

Completed Step 1 2008-10-29 08:47:59

Completed Step 2 2008-10-29 08:48:03

Completed Step 3 2008-10-29 08:48:07

The statement has been terminated.

Query was cancelled by user.

I stopped the query after 54 min 24 sec.

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 13

AnzioBake (10/29/2008)
hi Steve (SMunson)
To Test your Solution, I ran mine again as I made some changes. Execution time was 5 min 36 sec
I added a print statement after each update statement in your solution. These are the results
Process Start 2008-10-29 08:47:48
Completed Step 0 2008-10-29 08:47:55
Completed Step 1 2008-10-29 08:47:59
Completed Step 2 2008-10-29 08:48:03
Completed Step 3 2008-10-29 08:48:07
The statement has been terminated.
Query was cancelled by user.
I stopped the query after 54 min 24 sec.

For how many rows?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Garadin One Orange Chip Points: 29613 More actions · Answer 14

Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.

Due to the nature of the way my solution works (haven't looked at Anzio or Steve's well enough to tell if theirs are the same), these should not affect the processing time. Did you get a chance to test the performance of any of these methods last night?

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 15

I tested the Solutions against the Test dat set that I created. The Table Contained 1 Million records with sequences ranging from 1-35 rows.

My Solution returned 670000+ Rows in the result set