Finding duplicate values in very large table. Select Row_number() over too slow

Question

Post reply

Finding duplicate values in very large table. Select Row_number() over too slow

Indianrock

SSC-Insane

Points: 20333
More actions
February 23, 2011 at 12:30 pm

#388923

This query finds duplicate values and inserts the results into another table using the basic structure shown below. This table will grow from 35 million to probably 65 million records over the next year. It's causing a spike in tempdb up to 14,000 milliseconds wait per IO for about an hour each night. We need a better approach to this.
insert into .......
SELECT ROW_NUMBER
() OVER (
ORDER BY ....( orders by two varchar data type fields )
......
FROM SUMMARY_TABLE AS Summary02
CROSS JOIN
SUMMARY_TABLE AS Summary13
WHERE

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

ChazMan SSCrazy Points: 2259 More actions · Answer 1

Couldn't you just do a COUNT/HAVING approach?

Select Field1, Field2, COUNT(*) AS Records

FROM SUMMARY_TABLE

GROUP BY Field1, Field2

HAVING COUNT(*) > 1

That will identify the multiple occurrences and from there you can resolve as needed.

--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.

Indianrock SSC-Insane Points: 20333 More actions · Answer 2

I tested yours with our info and it ran in 1.5 minutes instead of 17 on a test box. But I'll have to see what the developers are trying to accomplish. Since most sql is object-generated I usually just pass on the culprits as a bug via our ticketing system.

ChazMan SSCrazy Points: 2259 More actions · Answer 3

Well let's hope it helps. Just based on the numbers you replied back with they should at least look at other "options".

--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

I guess I'd have to ask... why do they allow the "dupes" to continue?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChazMan SSCrazy Points: 2259 More actions · Answer 5

I'm with you on that one. I didn't want to clutter up the quick answer, but that is a great question. I think i'm afraid to hear the answer though.

--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.

Indianrock SSC-Insane Points: 20333 More actions · Answer 6

I'm not sure. These are car loans so it might be where a vin number exists twice, but one record is a paid off loan still in the database and the other is the same vin ( vehicle identification number ) showing on a newer loan.

I'm more familiar with our legacy system where a vin can exist twice but anything beyond the primary active record has a special status indicating it is in some sort of an error condition. In other words, there is a unique index on vin excepting that error status flag. Files from clients ( banks ) and state motor vehicle offices can arrive with a vin that already exists. Simply not allowing these files to load where the vin was a duplicate would create it's own problems.

ChazMan SSCrazy Points: 2259 More actions · Answer 7

Sounds like historical records without dates or status flags 🙂

Anyway, I hope the new version works well for you. A non-clustered index on those two fields should dramatically speed up the query as well (however I don't know how adversely it would impact your INSERT/UPDATE/DELETES). Definitely test it if you deside to create an index.

--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.

Indianrock SSC-Insane Points: 20333 More actions · Answer 8

It will be up to the developers to decide if they're going to revise anything. Unless Systems decides to "insist" I've done a lot of googling on row_count and have yet to see anyone else using it with a cross join on a huge table with a copy of itself.

ChazMan SSCrazy Points: 2259 More actions · Answer 9

Let us know how it goes!

--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.

Paul White SSC Guru Points: 150468 More actions · Answer 10

Indianrock (2/23/2011)
This query finds duplicate values and inserts the results into another table using the basic structure shown below.

An alternative solution would be to create insert, update, and delete triggers on the summary table to maintain a list of primary key values in the table where there are duplicates for the two character fields. This will add a small overhead to data-changing operations on the table, but finding duplicates would be trivial, and practically instantaneous. As an alternative to the triggers, you could also consider an indexed view (~~with a filtered non-clustered index in 2008~~).

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Indianrock SSC-Insane Points: 20333 More actions · Answer 11

We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 12

Indianrock (2/25/2011)
We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?

I think I know where Paul's going with that (I'm never quite sure, I'm still playing catchup to his knowledge... 🙂 ), but since you're in 2k5 you can't use a filtered index, so it's a moot point.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Paul White SSC Guru Points: 150468 More actions · Answer 13

Indianrock (2/25/2011)
We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?

I'll post an example in a second.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 14

CREATE TABLE

dbo.Summary

(

PK INTEGER PRIMARY KEY,

group1 INTEGER NOT NULL,

group2 INTEGER NOT NULL,

value INTEGER NOT NULL

);

GO

INSERT dbo.Summary VALUES (1, 1, 1, 100);

INSERT dbo.Summary VALUES (2, 1, 2, 200);

INSERT dbo.Summary VALUES (3, 2, 3, 300);

INSERT dbo.Summary VALUES (4, 2, 4, 400);

INSERT dbo.Summary VALUES (5, 2, 3, 500);

INSERT dbo.Summary VALUES (6, 3, 5, 600);

INSERT dbo.Summary VALUES (7, 3, 5, 700);

INSERT dbo.Summary VALUES (8, 4, 5, 800);

INSERT dbo.Summary VALUES (9, 5, 5, 900);

GO

CREATE INDEX nc1 ON dbo.Summary (group1, group2);

GO

-- View definiion

CREATE VIEW dbo.IV WITH SCHEMABINDING AS

SELECT group1, group2, COUNT_BIG(*) AS row_count

FROM dbo.Summary

GROUP BY

group1,

group2;

GO

-- Index it

CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.IV (group1, group2);

CREATE NONCLUSTERED INDEX nc1 ON dbo.IV (row_count);

GO

-- NOEXPAND hint required for non-Enterprise SKUs

SELECT IV.group1,

IV.group2

FROM dbo.IV AS IV WITH (NOEXPAND)

WHERE row_count > 1

GO

-- In Enterprise SKUs, this query seeks on the indexed view

SELECT group1,

group2,

COUNT_BIG(*) AS row_count

FROM dbo.Summary

GROUP BY

group1,

group2

HAVING COUNT_BIG(*) > 1

GO

DROP VIEW dbo.IV;

DROP TABLE dbo.Summary;

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi