Updating NULL Values

  • I have a table in the following format.

    ID RecType PKIdentifier RecordData

    1 01 acc00108 Text

    2 02 NULL Text

    3 03 NULL Text

    4 01 acc00208 Text

    5 03 NULL Text

    6 04 NULL Text

    7 05 NULL Text

    8 01 acc00307 Text

    9 02 NULL Text

    10 03 NULL Text

    etc.

    I need a query that will Update all of the NULL PKIdentifiers with the value of the PKIdentifier for RecType 01 that precedes it.

    For example records 2 and 3 need to be updated with the PKIdentifier from record 1. Records 5 -7 need to be updated with the PKIdentifiers from record 4 and so on. RecordType 01 is the parent and all their types are the child until you get to the next 01 RecType. I have approximately 20Million rows to update.

    What is the best query for this?

  • Please reference the post in my signature for how to post data on this forum.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • That said, I went ahead and wrote a script to do this, mainly because it interested me.

    The following script assumes several things. One is that there is no other piece of information that can tie these rows together. If there is, shame on you for not posting it. The second is that the id column is absolutely correct on the ordering. If it's not, you're prettymuch boned and can't really do anything with this data anyways. (assuming the first assumption is true)

    This solution is RBAR, and may not perform well on 20 million rows. I'd test the second part on 100,000 rows or so to see how it runs on your server and decide whether or not it's a viable solution. This is a SQL 2000 solution, and I'm very interested to see what 2K5 solutions would have made this a whole lot easier.

    [font="Courier New"]

    ----------------- Test Data Setup -------------------------

    CREATE TABLE test1(

    id     INT,

    rectype        CHAR(2),

    pkidentifier   VARCHAR(10),

    recorddata VARCHAR(10))

    INSERT INTO test1(id, rectype, pkidentifier, recorddata)

    SELECT 1,  '01', 'acc00108',  'Text' UNION ALL

    SELECT 2,  '02', NULL,        'Text' UNION ALL

    SELECT 3,  '03', NULL,        'Text' UNION ALL

    SELECT 4,  '01', 'acc00208',  'Text' UNION ALL

    SELECT 5,  '03', NULL,        'Text' UNION ALL

    SELECT 6,  '04', NULL,        'Text' UNION ALL

    SELECT 7,  '05', NULL,        'Text' UNION ALL

    SELECT 8,  '01', 'acc00307',  'Text' UNION ALL

    SELECT 9,  '02', NULL,        'Text' UNION ALL

    SELECT 10, '03', NULL,        'Text'

    -----------------------------------------------------------

    -----------------

    --Grab the Min identity for each actual value

    SELECT pkidentifier, MIN(id) minid

    INTO test2

    FROM test1

    WHERE pkidentifier IS NOT NULL

    GROUP BY pkidentifier

    -- Populate the identity value that holds the correct pkidentifier

    SELECT t1.id, (SELECT MAX(minid) FROM test2 t2 WHERE t1.id >= t2.minid) pkid

    INTO #3

    FROM test1 t1

    -- Update the original Table

    UPDATE T1

    SET pkidentifier = T.pkidentifier

    FROM test1 t1    INNER JOIN #3 T3 ON t1.id = t3.id

           INNER JOIN test1 T ON T3.pkid = T.id

    [/size][/font]

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • cdex3 (10/8/2008)


    I have a table in the following format.

    ID RecType PKIdentifier RecordData

    1 01 acc00108 Text

    2 02 NULL Text

    3 03 NULL Text

    4 01 acc00208 Text

    5 03 NULL Text

    6 04 NULL Text

    7 05 NULL Text

    8 01 acc00307 Text

    9 02 NULL Text

    10 03 NULL Text

    etc.

    I need a query that will Update all of the NULL PKIdentifiers with the value of the PKIdentifier for RecType 01 that precedes it.

    For example records 2 and 3 need to be updated with the PKIdentifier from record 1. Records 5 -7 need to be updated with the PKIdentifiers from record 4 and so on. RecordType 01 is the parent and all their types are the child until you get to the next 01 RecType. I have approximately 20Million rows to update.

    What is the best query for this?

    Garadin is correct... you'll get better answers faster if you follow the suggestions in the article linked in both our signatures.

    Anyway, this is a "smear" job... I'm talking about what needs to happen to accomplish this. Basically, it's identical to a "running total" problem as in the following link...

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

    ... but instead of adding data from the "previous" row, you're just copying the data. The following solution requires the right kind of clustered index and if you can't do that on your real table, you need to temporarily copy the data to another table that you can.

    Also, updating 20 million rows may reach a "tipping point" on your computer. What I mean by that is it'll only take the code I wrote for this below about 7 seconds per million rows to update. It may only take 14 seconds to update 2 million rows. But, once that "tipping point" is reached, it may take several minutes to update 3 million rows and several hours to update 20 million rows. Either way, the log file is going to take a beating on a 20 million row update.

    Of course, if the tipping point doesn't occur (each machine is different), then it'll update all 20 million rows in about 140 seconds.

    Here's the code...

    --===== Create and populate a test table like the "etiquette" article suggests

    -- This is NOT part of the solution but having a Clustered index on the ID column IS!

    --drop table #yourtable

    GO

    CREATE TABLE #yourtable

    (ID INT PRIMARY KEY CLUSTERED, RecType CHAR(2),PKIdentifier VARCHAR(10),RecordData VARCHAR(10))

    INSERT INTO #yourtable

    (ID,RecType,PKIdentifier,RecordData)

    SELECT '1','01','acc00108','Text' UNION ALL

    SELECT '2','02',NULL,'Text' UNION ALL

    SELECT '3','03',NULL,'Text' UNION ALL

    SELECT '4','01','acc00208','Text' UNION ALL

    SELECT '5','03',NULL,'Text' UNION ALL

    SELECT '6','04',NULL,'Text' UNION ALL

    SELECT '7','05',NULL,'Text' UNION ALL

    SELECT '8','01','acc00307','Text' UNION ALL

    SELECT '9','02',NULL,'Text' UNION ALL

    SELECT '10','03',NULL,'Text'

    --===== Declare a couple of necessary variables with obvious names

    DECLARE @PrevID INT,

    @PrevPKIdentifier VARCHAR(10)

    SELECT @PrevID = 0,

    @PrevPKIdentifier = ''

    --===== Do the update for rank like in the running total article based on a decrease in position

    UPDATE #yourtable

    SET @PrevPKIdentifier = PKIdentifier = CASE WHEN PKIdentifier IS NULL THEN @PrevPKIdentifier ELSE PKIdentifier END,

    @PrevID = ID --Provides a "stabilizing anchor"

    FROM #yourtable WITH (INDEX(0))

    --===== Show the final results...

    SELECT *

    FROM #yourtable

    What you MAY have to do if the tipping point occurs is to build an outer "control loop" to update, say, just a million rows at a time. No matter what, though... this will beat the pants off any solution that does use RBAR including a nice straight forward cursor.

    Of course, a RBAR solution will make each row "atomic" and will keep from locking the entire table like a set based solution. I'd rather make the table unavailable for the 2 1/2 minutes I think the set based solution would take, but if that's not an option, then a RBAR update crawler that uses a While loop may actually be the correct solution.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well, I spent a couple hours attempting to beat the speed of Jeff's method(not that I actually believed I could, but I figured a lot out about it while trying) and while I cleaned up my code considerably, it still fell short. I came pretty close with a version of code about halfway in between his and mine, which could possibly help in the instance that an operation like this involves multiple tables, but otherwise, there'd be no reason to use it.

    The following tests were done with a dataset of 5 Million rows. The speeds are all way slower than they'd be on any database server, as this was just done on my Dev workstation.

    SETUP:

    [font="Courier New"]----------------- Test Data Setup -------------------------

    IF OBJECT_ID('dbo.Test1') IS NOT NULL DROP TABLE Test1

    CREATE TABLE Test1(

    id             INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    rectype            VARCHAR(20),

    pkidentifier   VARCHAR(30),

    recorddata     VARCHAR(30))

    INSERT INTO test1(rectype, recorddata)

    SELECT 'Rec' + CAST(N AS VARCHAR(20)), 'A bit of useless text'

    FROM Tally -- 5 Million Rows, 1:25 from my local Dev Machines DB

    UPDATE Test1

    SET pkidentifier = 'PKID' + CAST(id AS VARCHAR(10))

    WHERE rectype LIKE '%7%' -- Add some PK's  (2342795 row(s) affected)  1:48 on Local

    -----------------------------------------------------------[/font]

    I added indexes and dropped the monstrously inefficient subquery out of my original code after a few minutes of thought and replaced it with this:

    [font="Courier New"]-------------------METHOD 1--------------------------------

    IF OBJECT_ID('dbo.Test2') IS NOT NULL DROP TABLE Test2

    CREATE TABLE Test2(

    pkidentifier   VARCHAR(30),

    minid          INT PRIMARY KEY CLUSTERED)

    --Grab the Min identity for each actual value

    INSERT INTO Test2(pkidentifier, minid)

    SELECT pkidentifier, MIN(id) minid

    FROM test1

    WHERE pkidentifier IS NOT NULL

    GROUP BY pkidentifier   -- (2342795 row(s) affected) 54 Seconds Local Machine.

    UPDATE Test1 -- Update the table with the PKID correlating to the min(ID) for each row.

    SET pkidentifier = ISNULL(pkidentifier, (SELECT TOP 1 pkidentifier FROM Test2 t2 WHERE t1.id >= t2.minid ORDER BY minid DESC))

    FROM Test1 T1  -- 3:40 on Local Machine.

    -----------------------------------------------------------[/font]

    Jeff's Method (Still wins hands down!)

    [font="Courier New"]-------------------- METHOD 2 -----------------------------

    DECLARE@PrevID           INT,

            @PrevPKIdentifier VARCHAR(30)

    SELECT @PrevID           = 0,

            @PrevPKIdentifier = ''

    --===== Do the update for rank like in the running total article based on a decrease in position

    UPDATE Test1

        SET @PrevPKIdentifier = PKIdentifier = CASE WHEN PKIdentifier IS NULL THEN @PrevPKIdentifier ELSE PKIdentifier END,

            @PrevID = ID  --Provides a "stabilizing anchor"

       FROM Test1 WITH (INDEX(0)) -- 3:52 on Local Machine.

    -----------------------------------------------------------

    [/font]

    P.S. Attempting to roll back a 5 million row update transaction on a workstation can be very bad.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Thanks for the info!

    This is my first post so I apologize for any lack of etiquette.

    Your never too old or too young to learn new things and ideas.

  • Jeff:

    Over the last few months, I've lost track of what all of the conditions are to make the UPDATE "pseudocursor" trick work. What I recall is:

    1) The Clustered INDEX must be on the columns that determine the ordering (and in the right order themselves).

    2) It must not be a Partitioned Table.

    and

    3) MAXDOP must be 1.

    Is that correct? Or did I get something wrong?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • That and you need a "anchor" column even if it doesn't do anything for the code... one that is just @variable = @column. Sometimes it works without one... don't let that fool you. Matt Miller kinda proved to himself that it's an absolute necessity in 2k5 and I pretty much trust Matt when he says he tested something.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Oh yeah... almost forgot... you can usually get away without using an Index "Hint", but like ORDER BY in a Select, it's always best to have one to force the query to use the clustered index. Using WITH(INDEX(0)) works great because it forces a scan and it doesn't have to be explicitly named... works great on Temp Tables where you should really be naming constraints.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • cdex3 (10/9/2008)


    Thanks for the info!

    This is my first post so I apologize for any lack of etiquette.

    Your never too old or too young to learn new things and ideas.

    Heh... You're ok. I should really rename that article... it's not about "etiquette"... it's about getting better answers quicker. Thanks for reading it.

    The big key is, have we solved your problem?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply