October 8, 2008 at 1:20 pm
I have a table in the following format.
ID RecType PKIdentifier RecordData
1 01 acc00108 Text
2 02 NULL Text
3 03 NULL Text
4 01 acc00208 Text
5 03 NULL Text
6 04 NULL Text
7 05 NULL Text
8 01 acc00307 Text
9 02 NULL Text
10 03 NULL Text
etc.
I need a query that will Update all of the NULL PKIdentifiers with the value of the PKIdentifier for RecType 01 that precedes it.
For example records 2 and 3 need to be updated with the PKIdentifier from record 1. Records 5 -7 need to be updated with the PKIdentifiers from record 4 and so on. RecordType 01 is the parent and all their types are the child until you get to the next 01 RecType. I have approximately 20Million rows to update.
What is the best query for this?
October 8, 2008 at 2:52 pm
October 8, 2008 at 3:53 pm
That said, I went ahead and wrote a script to do this, mainly because it interested me.
The following script assumes several things. One is that there is no other piece of information that can tie these rows together. If there is, shame on you for not posting it. The second is that the id column is absolutely correct on the ordering. If it's not, you're prettymuch boned and can't really do anything with this data anyways. (assuming the first assumption is true)
This solution is RBAR, and may not perform well on 20 million rows. I'd test the second part on 100,000 rows or so to see how it runs on your server and decide whether or not it's a viable solution. This is a SQL 2000 solution, and I'm very interested to see what 2K5 solutions would have made this a whole lot easier.
[font="Courier New"]
----------------- Test Data Setup -------------------------
CREATE TABLE test1(
id INT,
rectype CHAR(2),
pkidentifier VARCHAR(10),
recorddata VARCHAR(10))
INSERT INTO test1(id, rectype, pkidentifier, recorddata)
SELECT 1, '01', 'acc00108', 'Text' UNION ALL
SELECT 2, '02', NULL, 'Text' UNION ALL
SELECT 3, '03', NULL, 'Text' UNION ALL
SELECT 4, '01', 'acc00208', 'Text' UNION ALL
SELECT 5, '03', NULL, 'Text' UNION ALL
SELECT 6, '04', NULL, 'Text' UNION ALL
SELECT 7, '05', NULL, 'Text' UNION ALL
SELECT 8, '01', 'acc00307', 'Text' UNION ALL
SELECT 9, '02', NULL, 'Text' UNION ALL
SELECT 10, '03', NULL, 'Text'
-----------------------------------------------------------
-----------------
--Grab the Min identity for each actual value
SELECT pkidentifier, MIN(id) minid
INTO test2
FROM test1
WHERE pkidentifier IS NOT NULL
GROUP BY pkidentifier
-- Populate the identity value that holds the correct pkidentifier
SELECT t1.id, (SELECT MAX(minid) FROM test2 t2 WHERE t1.id >= t2.minid) pkid
INTO #3
FROM test1 t1
-- Update the original Table
UPDATE T1
SET pkidentifier = T.pkidentifier
FROM test1 t1 INNER JOIN #3 T3 ON t1.id = t3.id
INNER JOIN test1 T ON T3.pkid = T.id
[/size][/font]
October 9, 2008 at 5:50 am
cdex3 (10/8/2008)
I have a table in the following format.ID RecType PKIdentifier RecordData
1 01 acc00108 Text
2 02 NULL Text
3 03 NULL Text
4 01 acc00208 Text
5 03 NULL Text
6 04 NULL Text
7 05 NULL Text
8 01 acc00307 Text
9 02 NULL Text
10 03 NULL Text
etc.
I need a query that will Update all of the NULL PKIdentifiers with the value of the PKIdentifier for RecType 01 that precedes it.
For example records 2 and 3 need to be updated with the PKIdentifier from record 1. Records 5 -7 need to be updated with the PKIdentifiers from record 4 and so on. RecordType 01 is the parent and all their types are the child until you get to the next 01 RecType. I have approximately 20Million rows to update.
What is the best query for this?
Garadin is correct... you'll get better answers faster if you follow the suggestions in the article linked in both our signatures.
Anyway, this is a "smear" job... I'm talking about what needs to happen to accomplish this. Basically, it's identical to a "running total" problem as in the following link...
http://www.sqlservercentral.com/articles/Advanced+Querying/61716/
... but instead of adding data from the "previous" row, you're just copying the data. The following solution requires the right kind of clustered index and if you can't do that on your real table, you need to temporarily copy the data to another table that you can.
Also, updating 20 million rows may reach a "tipping point" on your computer. What I mean by that is it'll only take the code I wrote for this below about 7 seconds per million rows to update. It may only take 14 seconds to update 2 million rows. But, once that "tipping point" is reached, it may take several minutes to update 3 million rows and several hours to update 20 million rows. Either way, the log file is going to take a beating on a 20 million row update.
Of course, if the tipping point doesn't occur (each machine is different), then it'll update all 20 million rows in about 140 seconds.
Here's the code...
--===== Create and populate a test table like the "etiquette" article suggests
-- This is NOT part of the solution but having a Clustered index on the ID column IS!
--drop table #yourtable
GO
CREATE TABLE #yourtable
(ID INT PRIMARY KEY CLUSTERED, RecType CHAR(2),PKIdentifier VARCHAR(10),RecordData VARCHAR(10))
INSERT INTO #yourtable
(ID,RecType,PKIdentifier,RecordData)
SELECT '1','01','acc00108','Text' UNION ALL
SELECT '2','02',NULL,'Text' UNION ALL
SELECT '3','03',NULL,'Text' UNION ALL
SELECT '4','01','acc00208','Text' UNION ALL
SELECT '5','03',NULL,'Text' UNION ALL
SELECT '6','04',NULL,'Text' UNION ALL
SELECT '7','05',NULL,'Text' UNION ALL
SELECT '8','01','acc00307','Text' UNION ALL
SELECT '9','02',NULL,'Text' UNION ALL
SELECT '10','03',NULL,'Text'
--===== Declare a couple of necessary variables with obvious names
DECLARE @PrevID INT,
@PrevPKIdentifier VARCHAR(10)
SELECT @PrevID = 0,
@PrevPKIdentifier = ''
--===== Do the update for rank like in the running total article based on a decrease in position
UPDATE #yourtable
SET @PrevPKIdentifier = PKIdentifier = CASE WHEN PKIdentifier IS NULL THEN @PrevPKIdentifier ELSE PKIdentifier END,
@PrevID = ID --Provides a "stabilizing anchor"
FROM #yourtable WITH (INDEX(0))
--===== Show the final results...
SELECT *
FROM #yourtable
What you MAY have to do if the tipping point occurs is to build an outer "control loop" to update, say, just a million rows at a time. No matter what, though... this will beat the pants off any solution that does use RBAR including a nice straight forward cursor.
Of course, a RBAR solution will make each row "atomic" and will keep from locking the entire table like a set based solution. I'd rather make the table unavailable for the 2 1/2 minutes I think the set based solution would take, but if that's not an option, then a RBAR update crawler that uses a While loop may actually be the correct solution.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2008 at 10:10 am
Well, I spent a couple hours attempting to beat the speed of Jeff's method(not that I actually believed I could, but I figured a lot out about it while trying) and while I cleaned up my code considerably, it still fell short. I came pretty close with a version of code about halfway in between his and mine, which could possibly help in the instance that an operation like this involves multiple tables, but otherwise, there'd be no reason to use it.
The following tests were done with a dataset of 5 Million rows. The speeds are all way slower than they'd be on any database server, as this was just done on my Dev workstation.
SETUP:
[font="Courier New"]----------------- Test Data Setup -------------------------
IF OBJECT_ID('dbo.Test1') IS NOT NULL DROP TABLE Test1
CREATE TABLE Test1(
id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
rectype VARCHAR(20),
pkidentifier VARCHAR(30),
recorddata VARCHAR(30))
INSERT INTO test1(rectype, recorddata)
SELECT 'Rec' + CAST(N AS VARCHAR(20)), 'A bit of useless text'
FROM Tally -- 5 Million Rows, 1:25 from my local Dev Machines DB
UPDATE Test1
SET pkidentifier = 'PKID' + CAST(id AS VARCHAR(10))
WHERE rectype LIKE '%7%' -- Add some PK's (2342795 row(s) affected) 1:48 on Local
-----------------------------------------------------------[/font]
I added indexes and dropped the monstrously inefficient subquery out of my original code after a few minutes of thought and replaced it with this:
[font="Courier New"]-------------------METHOD 1--------------------------------
IF OBJECT_ID('dbo.Test2') IS NOT NULL DROP TABLE Test2
CREATE TABLE Test2(
pkidentifier VARCHAR(30),
minid INT PRIMARY KEY CLUSTERED)
--Grab the Min identity for each actual value
INSERT INTO Test2(pkidentifier, minid)
SELECT pkidentifier, MIN(id) minid
FROM test1
WHERE pkidentifier IS NOT NULL
GROUP BY pkidentifier -- (2342795 row(s) affected) 54 Seconds Local Machine.
UPDATE Test1 -- Update the table with the PKID correlating to the min(ID) for each row.
SET pkidentifier = ISNULL(pkidentifier, (SELECT TOP 1 pkidentifier FROM Test2 t2 WHERE t1.id >= t2.minid ORDER BY minid DESC))
FROM Test1 T1 -- 3:40 on Local Machine.
-----------------------------------------------------------[/font]
Jeff's Method (Still wins hands down!)
[font="Courier New"]-------------------- METHOD 2 -----------------------------
DECLARE@PrevID INT,
@PrevPKIdentifier VARCHAR(30)
SELECT @PrevID = 0,
@PrevPKIdentifier = ''
--===== Do the update for rank like in the running total article based on a decrease in position
UPDATE Test1
SET @PrevPKIdentifier = PKIdentifier = CASE WHEN PKIdentifier IS NULL THEN @PrevPKIdentifier ELSE PKIdentifier END,
@PrevID = ID --Provides a "stabilizing anchor"
FROM Test1 WITH (INDEX(0)) -- 3:52 on Local Machine.
-----------------------------------------------------------
[/font]
P.S. Attempting to roll back a 5 million row update transaction on a workstation can be very bad.
October 9, 2008 at 11:02 am
Thanks for the info!
This is my first post so I apologize for any lack of etiquette.
Your never too old or too young to learn new things and ideas.
October 9, 2008 at 12:07 pm
Jeff:
Over the last few months, I've lost track of what all of the conditions are to make the UPDATE "pseudocursor" trick work. What I recall is:
1) The Clustered INDEX must be on the columns that determine the ordering (and in the right order themselves).
2) It must not be a Partitioned Table.
and
3) MAXDOP must be 1.
Is that correct? Or did I get something wrong?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
October 9, 2008 at 6:01 pm
That and you need a "anchor" column even if it doesn't do anything for the code... one that is just @variable = @column. Sometimes it works without one... don't let that fool you. Matt Miller kinda proved to himself that it's an absolute necessity in 2k5 and I pretty much trust Matt when he says he tested something.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2008 at 6:04 pm
Oh yeah... almost forgot... you can usually get away without using an Index "Hint", but like ORDER BY in a Select, it's always best to have one to force the query to use the clustered index. Using WITH(INDEX(0)) works great because it forces a scan and it doesn't have to be explicitly named... works great on Temp Tables where you should really be naming constraints.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2008 at 6:17 pm
cdex3 (10/9/2008)
Thanks for the info!This is my first post so I apologize for any lack of etiquette.
Your never too old or too young to learn new things and ideas.
Heh... You're ok. I should really rename that article... it's not about "etiquette"... it's about getting better answers quicker. Thanks for reading it.
The big key is, have we solved your problem?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply