Updating NULL Values

Question

Post reply

Updating NULL Values

cdex3-408940

Old Hand

Points: 379
More actions
October 8, 2008 at 1:20 pm

#199529

I have a table in the following format.
ID RecType PKIdentifier RecordData
1 01 acc00108 Text
2 02 NULL Text
3 03 NULL Text
4 01 acc00208 Text
5 03 NULL Text
6 04 NULL Text
7 05 NULL Text
8 01 acc00307 Text
9 02 NULL Text
10 03 NULL Text
etc.
I need a query that will Update all of the NULL PKIdentifiers with the value of the PKIdentifier for RecType 01 that precedes it.
For example records 2 and 3 need to be updated with the PKIdentifier from record 1. Records 5 -7 need to be updated with the PKIdentifiers from record 4 and so on. RecordType 01 is the parent and all their types are the child until you get to the next 01 RecType. I have approximately 20Million rows to update.
What is the best query for this?

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Garadin One Orange Chip Points: 29613 More actions · Answer 1

Please reference the post in my signature for how to post data on this forum.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Garadin One Orange Chip Points: 29613 More actions · Answer 2

That said, I went ahead and wrote a script to do this, mainly because it interested me.

The following script assumes several things. One is that there is no other piece of information that can tie these rows together. If there is, shame on you for not posting it. The second is that the id column is absolutely correct on the ordering. If it's not, you're prettymuch boned and can't really do anything with this data anyways. (assuming the first assumption is true)

This solution is RBAR, and may not perform well on 20 million rows. I'd test the second part on 100,000 rows or so to see how it runs on your server and decide whether or not it's a viable solution. This is a SQL 2000 solution, and I'm very interested to see what 2K5 solutions would have made this a whole lot easier.

[font="Courier New"]

----------------- Test Data Setup -------------------------

CREATE TABLE test1(

id INT,

rectype CHAR(2),

pkidentifier VARCHAR(10),

recorddata VARCHAR(10))

INSERT INTO test1(id, rectype, pkidentifier, recorddata)

SELECT 1, '01', 'acc00108', 'Text' UNION ALL

SELECT 2, '02', NULL, 'Text' UNION ALL

SELECT 3, '03', NULL, 'Text' UNION ALL

SELECT 4, '01', 'acc00208', 'Text' UNION ALL

SELECT 5, '03', NULL, 'Text' UNION ALL

SELECT 6, '04', NULL, 'Text' UNION ALL

SELECT 7, '05', NULL, 'Text' UNION ALL

SELECT 8, '01', 'acc00307', 'Text' UNION ALL

SELECT 9, '02', NULL, 'Text' UNION ALL

SELECT 10, '03', NULL, 'Text'

-----------------------------------------------------------

-----------------

--Grab the Min identity for each actual value

SELECT pkidentifier, MIN(id) minid

INTO test2

FROM test1

WHERE pkidentifier IS NOT NULL

GROUP BY pkidentifier

-- Populate the identity value that holds the correct pkidentifier

SELECT t1.id, (SELECT MAX(minid) FROM test2 t2 WHERE t1.id >= t2.minid) pkid

INTO #3

FROM test1 t1

-- Update the original Table

UPDATE T1

SET pkidentifier = T.pkidentifier

FROM test1 t1 INNER JOIN #3 T3 ON t1.id = t3.id

INNER JOIN test1 T ON T3.pkid = T.id

[/size][/font]

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

cdex3 (10/8/2008)
I have a table in the following format.
ID RecType PKIdentifier RecordData
1 01 acc00108 Text
2 02 NULL Text
3 03 NULL Text
4 01 acc00208 Text
5 03 NULL Text
6 04 NULL Text
7 05 NULL Text
8 01 acc00307 Text
9 02 NULL Text
10 03 NULL Text
etc.
I need a query that will Update all of the NULL PKIdentifiers with the value of the PKIdentifier for RecType 01 that precedes it.
For example records 2 and 3 need to be updated with the PKIdentifier from record 1. Records 5 -7 need to be updated with the PKIdentifiers from record 4 and so on. RecordType 01 is the parent and all their types are the child until you get to the next 01 RecType. I have approximately 20Million rows to update.
What is the best query for this?

Garadin is correct... you'll get better answers faster if you follow the suggestions in the article linked in both our signatures.

Anyway, this is a "smear" job... I'm talking about what needs to happen to accomplish this. Basically, it's identical to a "running total" problem as in the following link...

http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

... but instead of adding data from the "previous" row, you're just copying the data. The following solution requires the right kind of clustered index and if you can't do that on your real table, you need to temporarily copy the data to another table that you can.

Also, updating 20 million rows may reach a "tipping point" on your computer. What I mean by that is it'll only take the code I wrote for this below about 7 seconds per million rows to update. It may only take 14 seconds to update 2 million rows. But, once that "tipping point" is reached, it may take several minutes to update 3 million rows and several hours to update 20 million rows. Either way, the log file is going to take a beating on a 20 million row update.

Of course, if the tipping point doesn't occur (each machine is different), then it'll update all 20 million rows in about 140 seconds.

Here's the code...

--===== Create and populate a test table like the "etiquette" article suggests

-- This is NOT part of the solution but having a Clustered index on the ID column IS!

--drop table #yourtable

GO

CREATE TABLE #yourtable

(ID INT PRIMARY KEY CLUSTERED, RecType CHAR(2),PKIdentifier VARCHAR(10),RecordData VARCHAR(10))

INSERT INTO #yourtable

(ID,RecType,PKIdentifier,RecordData)

SELECT '1','01','acc00108','Text' UNION ALL

SELECT '2','02',NULL,'Text' UNION ALL

SELECT '3','03',NULL,'Text' UNION ALL

SELECT '4','01','acc00208','Text' UNION ALL

SELECT '5','03',NULL,'Text' UNION ALL

SELECT '6','04',NULL,'Text' UNION ALL

SELECT '7','05',NULL,'Text' UNION ALL

SELECT '8','01','acc00307','Text' UNION ALL

SELECT '9','02',NULL,'Text' UNION ALL

SELECT '10','03',NULL,'Text'

--===== Declare a couple of necessary variables with obvious names

DECLARE @PrevID INT,

@PrevPKIdentifier VARCHAR(10)

SELECT @PrevID = 0,

@PrevPKIdentifier = ''

--===== Do the update for rank like in the running total article based on a decrease in position

UPDATE #yourtable

SET @PrevPKIdentifier = PKIdentifier = CASE WHEN PKIdentifier IS NULL THEN @PrevPKIdentifier ELSE PKIdentifier END,

@PrevID = ID --Provides a "stabilizing anchor"

FROM #yourtable WITH (INDEX(0))

--===== Show the final results...

SELECT *

FROM #yourtable

What you MAY have to do if the tipping point occurs is to build an outer "control loop" to update, say, just a million rows at a time. No matter what, though... this will beat the pants off any solution that does use RBAR including a nice straight forward cursor.

Of course, a RBAR solution will make each row "atomic" and will keep from locking the entire table like a set based solution. I'd rather make the table unavailable for the 2 1/2 minutes I think the set based solution would take, but if that's not an option, then a RBAR update crawler that uses a While loop may actually be the correct solution.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Garadin One Orange Chip Points: 29613 More actions · Answer 4

Well, I spent a couple hours attempting to beat the speed of Jeff's method(not that I actually believed I could, but I figured a lot out about it while trying) and while I cleaned up my code considerably, it still fell short. I came pretty close with a version of code about halfway in between his and mine, which could possibly help in the instance that an operation like this involves multiple tables, but otherwise, there'd be no reason to use it.

The following tests were done with a dataset of 5 Million rows. The speeds are all way slower than they'd be on any database server, as this was just done on my Dev workstation.

SETUP:

[font="Courier New"]----------------- Test Data Setup -------------------------

IF OBJECT_ID('dbo.Test1') IS NOT NULL DROP TABLE Test1

CREATE TABLE Test1(

id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

rectype VARCHAR(20),

pkidentifier VARCHAR(30),

recorddata VARCHAR(30))

INSERT INTO test1(rectype, recorddata)

SELECT 'Rec' + CAST(N AS VARCHAR(20)), 'A bit of useless text'

FROM Tally -- 5 Million Rows, 1:25 from my local Dev Machines DB

UPDATE Test1

SET pkidentifier = 'PKID' + CAST(id AS VARCHAR(10))

WHERE rectype LIKE '%7%' -- Add some PK's (2342795 row(s) affected) 1:48 on Local

-----------------------------------------------------------[/font]

I added indexes and dropped the monstrously inefficient subquery out of my original code after a few minutes of thought and replaced it with this:

[font="Courier New"]-------------------METHOD 1--------------------------------

IF OBJECT_ID('dbo.Test2') IS NOT NULL DROP TABLE Test2

CREATE TABLE Test2(

pkidentifier VARCHAR(30),

minid INT PRIMARY KEY CLUSTERED)

--Grab the Min identity for each actual value

INSERT INTO Test2(pkidentifier, minid)

SELECT pkidentifier, MIN(id) minid

FROM test1

WHERE pkidentifier IS NOT NULL

GROUP BY pkidentifier -- (2342795 row(s) affected) 54 Seconds Local Machine.

UPDATE Test1 -- Update the table with the PKID correlating to the min(ID) for each row.

SET pkidentifier = ISNULL(pkidentifier, (SELECT TOP 1 pkidentifier FROM Test2 t2 WHERE t1.id >= t2.minid ORDER BY minid DESC))

FROM Test1 T1 -- 3:40 on Local Machine.

-----------------------------------------------------------[/font]

Jeff's Method (Still wins hands down!)

[font="Courier New"]-------------------- METHOD 2 -----------------------------

DECLARE@PrevID INT,

@PrevPKIdentifier VARCHAR(30)

SELECT @PrevID = 0,

@PrevPKIdentifier = ''

--===== Do the update for rank like in the running total article based on a decrease in position

UPDATE Test1

SET @PrevPKIdentifier = PKIdentifier = CASE WHEN PKIdentifier IS NULL THEN @PrevPKIdentifier ELSE PKIdentifier END,

@PrevID = ID --Provides a "stabilizing anchor"

FROM Test1 WITH (INDEX(0)) -- 3:52 on Local Machine.

-----------------------------------------------------------

[/font]

P.S. Attempting to roll back a 5 million row update transaction on a workstation can be very bad.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

cdex3-408940 Old Hand Points: 379 More actions · Answer 5

Thanks for the info!

This is my first post so I apologize for any lack of etiquette.

Your never too old or too young to learn new things and ideas.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Jeff:

Over the last few months, I've lost track of what all of the conditions are to make the UPDATE "pseudocursor" trick work. What I recall is:

1) The Clustered INDEX must be on the columns that determine the ordering (and in the right order themselves).

2) It must not be a Partitioned Table.

and

3) MAXDOP must be 1.

Is that correct? Or did I get something wrong?

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 7

That and you need a "anchor" column even if it doesn't do anything for the code... one that is just @variable = @column. Sometimes it works without one... don't let that fool you. Matt Miller kinda proved to himself that it's an absolute necessity in 2k5 and I pretty much trust Matt when he says he tested something.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 8

Oh yeah... almost forgot... you can usually get away without using an Index "Hint", but like ORDER BY in a Select, it's always best to have one to force the query to use the clustered index. Using WITH(INDEX(0)) works great because it forces a scan and it doesn't have to be explicitly named... works great on Temp Tables where you should really be naming constraints.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

cdex3 (10/9/2008)
Thanks for the info!
This is my first post so I apologize for any lack of etiquette.
Your never too old or too young to learn new things and ideas.

Heh... You're ok. I should really rename that article... it's not about "etiquette"... it's about getting better answers quicker. Thanks for reading it.

The big key is, have we solved your problem?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)