February 23, 2011 at 12:30 pm
This query finds duplicate values and inserts the results into another table using the basic structure shown below. This table will grow from 35 million to probably 65 million records over the next year. It's causing a spike in tempdb up to 14,000 milliseconds wait per IO for about an hour each night. We need a better approach to this.
insert into .......
SELECT ROW_NUMBER
() OVER (
ORDER BY ....( orders by two varchar data type fields )
......
FROM SUMMARY_TABLE AS Summary02
CROSS JOIN
SUMMARY_TABLE AS Summary13
WHERE
February 23, 2011 at 12:42 pm
Couldn't you just do a COUNT/HAVING approach?
Select Field1, Field2, COUNT(*) AS Records
FROM SUMMARY_TABLE
GROUP BY Field1, Field2
HAVING COUNT(*) > 1
That will identify the multiple occurrences and from there you can resolve as needed.
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
February 23, 2011 at 1:11 pm
I tested yours with our info and it ran in 1.5 minutes instead of 17 on a test box. But I'll have to see what the developers are trying to accomplish. Since most sql is object-generated I usually just pass on the culprits as a bug via our ticketing system.
February 23, 2011 at 1:13 pm
Well let's hope it helps. Just based on the numbers you replied back with they should at least look at other "options".
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
February 23, 2011 at 4:29 pm
I guess I'd have to ask... why do they allow the "dupes" to continue?
--Jeff Moden
Change is inevitable... Change for the better is not.
February 23, 2011 at 4:32 pm
I'm with you on that one. I didn't want to clutter up the quick answer, but that is a great question. I think i'm afraid to hear the answer though.
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
February 23, 2011 at 5:44 pm
I'm not sure. These are car loans so it might be where a vin number exists twice, but one record is a paid off loan still in the database and the other is the same vin ( vehicle identification number ) showing on a newer loan.
I'm more familiar with our legacy system where a vin can exist twice but anything beyond the primary active record has a special status indicating it is in some sort of an error condition. In other words, there is a unique index on vin excepting that error status flag. Files from clients ( banks ) and state motor vehicle offices can arrive with a vin that already exists. Simply not allowing these files to load where the vin was a duplicate would create it's own problems.
February 23, 2011 at 5:48 pm
Sounds like historical records without dates or status flags 🙂
Anyway, I hope the new version works well for you. A non-clustered index on those two fields should dramatically speed up the query as well (however I don't know how adversely it would impact your INSERT/UPDATE/DELETES). Definitely test it if you deside to create an index.
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
February 23, 2011 at 5:53 pm
It will be up to the developers to decide if they're going to revise anything. Unless Systems decides to "insist" I've done a lot of googling on row_count and have yet to see anyone else using it with a cross join on a huge table with a copy of itself.
February 23, 2011 at 5:55 pm
Let us know how it goes!
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
February 23, 2011 at 11:54 pm
Indianrock (2/23/2011)
This query finds duplicate values and inserts the results into another table using the basic structure shown below.
An alternative solution would be to create insert, update, and delete triggers on the summary table to maintain a list of primary key values in the table where there are duplicates for the two character fields. This will add a small overhead to data-changing operations on the table, but finding duplicates would be trivial, and practically instantaneous. As an alternative to the triggers, you could also consider an indexed view (with a filtered non-clustered index in 2008).
February 25, 2011 at 10:31 am
We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?
February 25, 2011 at 10:59 am
Indianrock (2/25/2011)
We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?
I think I know where Paul's going with that (I'm never quite sure, I'm still playing catchup to his knowledge... 🙂 ), but since you're in 2k5 you can't use a filtered index, so it's a moot point.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
February 25, 2011 at 11:30 am
Indianrock (2/25/2011)
We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?
I'll post an example in a second.
February 25, 2011 at 12:03 pm
CREATE TABLE
dbo.Summary
(
PK INTEGER PRIMARY KEY,
group1 INTEGER NOT NULL,
group2 INTEGER NOT NULL,
value INTEGER NOT NULL
);
GO
INSERT dbo.Summary VALUES (1, 1, 1, 100);
INSERT dbo.Summary VALUES (2, 1, 2, 200);
INSERT dbo.Summary VALUES (3, 2, 3, 300);
INSERT dbo.Summary VALUES (4, 2, 4, 400);
INSERT dbo.Summary VALUES (5, 2, 3, 500);
INSERT dbo.Summary VALUES (6, 3, 5, 600);
INSERT dbo.Summary VALUES (7, 3, 5, 700);
INSERT dbo.Summary VALUES (8, 4, 5, 800);
INSERT dbo.Summary VALUES (9, 5, 5, 900);
GO
CREATE INDEX nc1 ON dbo.Summary (group1, group2);
GO
-- View definiion
CREATE VIEW dbo.IV WITH SCHEMABINDING AS
SELECT group1, group2, COUNT_BIG(*) AS row_count
FROM dbo.Summary
GROUP BY
group1,
group2;
GO
-- Index it
CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.IV (group1, group2);
CREATE NONCLUSTERED INDEX nc1 ON dbo.IV (row_count);
GO
-- NOEXPAND hint required for non-Enterprise SKUs
SELECT IV.group1,
IV.group2
FROM dbo.IV AS IV WITH (NOEXPAND)
WHERE row_count > 1
GO
-- In Enterprise SKUs, this query seeks on the indexed view
SELECT group1,
group2,
COUNT_BIG(*) AS row_count
FROM dbo.Summary
GROUP BY
group1,
group2
HAVING COUNT_BIG(*) > 1
GO
DROP VIEW dbo.IV;
DROP TABLE dbo.Summary;
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply