Change column smallint to int 300 million records

  • Unless you expect your database to remain small, the lesson seems to be make your data types large enough for the next 10 years, not for the next 10 days.

  • Indianrock (2/9/2016)


    Unless you expect your database to remain small, the lesson seems to be make your data types large enough for the next 10 years, not for the next 10 days.

    Lessons from history teach us that they don't teach us anything.

    🙂

    I'm still to find that commercial software house which includes scalability tests in their pre-production release routine...

    _____________
    Code for TallyGenerator

  • Sergiy (2/9/2016)


    Igor Micev (2/9/2016)


    I don't say Sergiy approach is not going to work, but it's more risky. Simply you can do it without dropping the table.

    -- Steps:

    -- 1. Script the Drop and Create statements for the dependecies of column Line_Number

    -- 2. Drop the constratins, indexes on column Line_number

    Igor, 3300 updates by 100k rows each will take rather long time to execute.

    And all this time the constraints, indexes will be missing, affecting current functionality in who knows which way.

    I'd say this is risky.

    --5. Run the create/enable statements from step 1

    This step can fail too, you know.

    It must be placed inside transaction, so all the changes can be reversed, if some of them cannot be completed.

    Look how it's done in SSMS.

    The table is operational. You can add index on the two columns for faster work. It's not just run the query, and for your steps too, it requires extra work of course

    Igor Micev,My blog: www.igormicev.com

  • Igor Micev (2/9/2016)


    Put the db in simple recovery model

    That alone would be enough to see your stuff packed in a box.

    Igor, all your suggestions may be good for a sand-box database in DEV environment.

    But they absolutely unacceptable for a live database in PROD.

    Especially for a big live database in PROD.

    _____________
    Code for TallyGenerator

  • Igor Micev (2/9/2016)


    The table is operational.

    Yeah, right.

    I once disabled an index which was a duplicate of another one and was only taking space without any benefit (as I thought).

    Hours later it turned out the Invoicing module failed because one of the functions within it was querying the table with a hint WITH (INDEX ...) which, of course, was pointing onto exactly that index.

    Who would have thought?

    😉

    _____________
    Code for TallyGenerator

  • Sergiy (2/9/2016)


    Igor Micev (2/9/2016)


    The table is operational.

    Yeah, right.

    I once disabled an index which was a duplicate of another one and was only taking space without any benefit (as I thought).

    Hours later it turned out the Invoicing module failed because one of the functions within it was querying the table with a hint WITH (INDEX ...) which, of course, was pointing onto exactly that index.

    Who would have thought?

    😉

    Then, won't you meet that problem again?

    Igor Micev,My blog: www.igormicev.com

  • Igor Micev (2/9/2016)


    Then, won't you meet that problem again?

    With my approach?

    No.

    Read it again:

    1. Create a new table with desired table design.

    2. Start populating the new table with portions of data from the existing table.

    Choose appropriate size of chunks, so you do not overwhelm your transaction log, and take TRN backups in between of INSERT statements.

    It may take many hours or even days - does not matter, as the original table is still operational.

    Make sure that any changes made to the data in the original table get copied to the new one.

    Where do you see anything about altering the existing table?

    While the data is being copied you have plenty of time to prepare and test all the scripts for indexes, privileges, depending objects, etc.

    When everything is ready - lock the whole table in a transaction, copy the last bit of freshly changed data, drop old table, rename new one and run the prepared script for the dependencies.

    If everything went file - COMMIT.

    Otherwise - ROLLBACK.

    Either way - it's done within seconds (as you modify only system catalogues), no noticeable interruption to the normal operation.

    You end up either with updated column, or (in case of a failure) with the same 2 old and new tables, and the background process which keeps synchronising data in them.

    Fix errors in your script and repeat the attempt in an appropriate moment.

    _____________
    Code for TallyGenerator

  • Sergiy (2/9/2016)


    Igor Micev (2/9/2016)


    Then, won't you meet that problem again?

    With my approach?

    No.

    Read it again:

    1. Create a new table with desired table design.

    2. Start populating the new table with portions of data from the existing table.

    Choose appropriate size of chunks, so you do not overwhelm your transaction log, and take TRN backups in between of INSERT statements.

    It may take many hours or even days - does not matter, as the original table is still operational.

    Make sure that any changes made to the data in the original table get copied to the new one.

    Where do you see anything about altering the existing table?

    While the data is being copied you have plenty of time to prepare and test all the scripts for indexes, privileges, depending objects, etc.

    When everything is ready - lock the whole table in a transaction, copy the last bit of freshly changed data, drop old table, rename new one and run the prepared script for the dependencies.

    If everything went file - COMMIT.

    Otherwise - ROLLBACK.

    Either way - it's done within seconds (as you modify only system catalogues), no noticeable interruption to the normal operation.

    You end up either with updated column, or (in case of a failure) with the same 2 old and new tables, and the background process which keeps synchronising data in them.

    Fix errors in your script and repeat the attempt in an appropriate moment.

    Not so different from my approach. I make a column drop and rename, and you make a table drop and rename. The prepared dependencies goes for creation after within a transaction, all steps. In case of fail the table is still alive.

    Igor Micev,My blog: www.igormicev.com

  • Igor Micev (2/9/2016)


    Not so different from my approach. I make a column drop and rename, and you make a table drop and rename. The prepared dependencies goes for creation after within a transaction, all steps. In case of fail the table is still alive.

    Totally different.

    Whole world away.

    Altering a column changes every record in the table which is supposed to stay operational.

    Every data page and most of index pages get affected by "drop and rename column".

    Massive data change.

    And you need to do it within a transaction, so if anything goes wrong you can roll back to the original state.

    Which means - the table is not accessible for the whole duration of 0.5TB table being re-written.

    With my approach all the data get altered outside the transaction, by the asynchronous process which makes it ready when it's ready.

    Transaction only covers "last minute" data changes in the original table plus system catalogue alterations.

    Renaming a table does not change a single data page in it, so it usually gets done in no time.

    See the difference now?

    _____________
    Code for TallyGenerator

  • Sergiy (2/9/2016)


    Igor Micev (2/9/2016)


    Not so different from my approach. I make a column drop and rename, and you make a table drop and rename. The prepared dependencies goes for creation after within a transaction, all steps. In case of fail the table is still alive.

    Totally different.

    Whole world away.

    Altering a column changes every record in the table which is supposed to stay operational.

    Every data page and most of index pages get affected by "drop and rename column".

    Massive data change.

    And you need to do it within a transaction, so if anything goes wrong you can roll back to the original state.

    Which means - the table is not accessible for the whole duration of 0.5TB table being re-written.

    With my approach all the data get altered outside the transaction, by the asynchronous process which makes it ready when it's ready.

    Transaction only covers "last minute" data changes in the original table plus system catalogue alterations.

    Renaming a table does not change a single data page in it, so it usually gets done in no time.

    See the difference now?

    Correct.

    Igor Micev,My blog: www.igormicev.com

  • Here is the quickly baked test script:

    DROP TABLE TestCalendar

    CREATE TABLE [dbo].[TestCalendar](

    [ID] [bigint] NOT NULL,

    [N_Date] [datetime] NOT NULL,

    [WeekDay] [smallint] NULL,

    PRIMARY KEY (ID) WITH FILLFACTOR = 100

    )

    CREATE TABLE [dbo].[Tmp_TestCalendar](

    [ID] [bigint] NOT NULL,

    [N_Date] [datetime] NOT NULL,

    [WeekDay] [int] NULL,

    PRIMARY KEY (ID) WITH FILLFACTOR = 100

    )

    SET NOCOUNT ON

    INSERT INTO dbo.TestCalendar ( ID, N_Date, WeekDay )

    SELECT tg.N ID, DATEADD(dd, N, 0) N_Date, CONVERT(SMALLINT, n%7 +1 ) WeekDay

    FROM Service.dbo.TallyGenerator(0, 50000, NULL, 1) tg

    ORDER BY ID

    WHILE EXISTS (SELECT * FROM dbo.TestCalendar T1

    WHERE NOT EXISTS (SELECT * FROM dbo.[Tmp_TestCalendar] T2

    WHERE T2.ID = T1.ID)

    )

    BEGIN

    INSERT INTO dbo.Tmp_TestCalendar ( ID, N_Date, WeekDay )

    SELECT TOP 1000 ID, N_Date, WeekDay

    FROM dbo.TestCalendar T1

    WHERE NOT EXISTS (SELECT * FROM dbo.[Tmp_TestCalendar] T2

    WHERE T2.ID = T1.ID)

    ORDER BY ID

    END

    DBCC SHOWCONTIG (TestCalendar) WITH ALL_INDEXES--, TABLERESULTS

    DBCC SHOWCONTIG (Tmp_TestCalendar) WITH ALL_INDEXES--, TABLERESULTS

    EXEC sys.sp_spaceused @objname = N'TestCalendar', @updateusage = 'true'

    ALTER TABLE [dbo].TestCalendar ADD WeekDay_int int NULL;

    EXEC sys.sp_spaceused @objname = N'TestCalendar', @updateusage = 'true'

    GO

    UPDATE [dbo].TestCalendar

    SET WeekDay_int = WeekDay

    EXEC sys.sp_spaceused @objname = N'TestCalendar', @updateusage = 'true'

    ALTER TABLE [dbo].TestCalendar DROP COLUMN [WeekDay];

    EXEC sp_rename 'dbo.TestCalendar.WeekDay_int', 'WeekDay', 'COLUMN';

    EXEC sys.sp_spaceused @objname = N'TestCalendar', @updateusage = 'true'

    DBCC SHOWCONTIG (TestCalendar) WITH ALL_INDEXES--, TABLERESULTS

    DROP TABLE dbo.TestCalendar

    EXECUTE sp_rename N'dbo.Tmp_TestCalendar', N'TestCalendar', 'OBJECT'

    EXEC sys.sp_spaceused @objname = N'TestCalendar', @updateusage = 'true'

    DBCC SHOWCONTIG (TestCalendar) WITH ALL_INDEXES--, TABLERESULTS

    And here what SHOWCONTIG returns:

    Initial population of TestCalendar:

    DBCC SHOWCONTIG scanning 'TestCalendar' table...

    Table: 'TestCalendar' (997578592); index ID: 1, database ID: 2

    TABLE level scan performed.

    - Pages Scanned................................: 168

    - Extents Scanned..............................: 23

    - Extent Switches..............................: 22

    - Avg. Pages per Extent........................: 7.3

    - Scan Density [Best Count:Actual Count].......: 91.30% [21:23]

    - Logical Scan Fragmentation ..................: 1.79%

    - Extent Scan Fragmentation ...................: 4.35%

    - Avg. Bytes Free per Page.....................: 60.1

    - Avg. Page Density (full).....................: 99.26%

    "Sequentially" populated Tmp_TestCalendar:

    DBCC SHOWCONTIG scanning 'Tmp_TestCalendar' table...

    Table: 'Tmp_TestCalendar' (1061578820); index ID: 1, database ID: 2

    TABLE level scan performed.

    - Pages Scanned................................: 180

    - Extents Scanned..............................: 27

    - Extent Switches..............................: 26

    - Avg. Pages per Extent........................: 6.7

    - Scan Density [Best Count:Actual Count].......: 85.19% [23:27]

    - Logical Scan Fragmentation ..................: 2.78%

    - Extent Scan Fragmentation ...................: 11.11%

    - Avg. Bytes Free per Page.....................: 40.3

    - Avg. Page Density (full).....................: 99.50%

    Number of pages is slightly bigger, as we replaced 2 byte smallint with 4 byte int.

    Now - after altering the column in TestCalendar:

    DBCC SHOWCONTIG scanning 'TestCalendar' table...

    Table: 'TestCalendar' (997578592); index ID: 1, database ID: 2

    TABLE level scan performed.

    - Pages Scanned................................: 335

    - Extents Scanned..............................: 43

    - Extent Switches..............................: 334

    - Avg. Pages per Extent........................: 7.8

    - Scan Density [Best Count:Actual Count].......: 12.54% [42:335]

    - Logical Scan Fragmentation ..................: 99.70%

    - Extent Scan Fragmentation ...................: 4.65%

    - Avg. Bytes Free per Page.....................: 3469.0

    - Avg. Page Density (full).....................: 57.14%

    As promised - number of pages doubled.

    Fragmentation - through the roof.

    Scan density - miserable.

    Must be a lot of data manipulations done on the way, I suppose.

    And we ended up with a mess in the table you have to tidy up afterwards.

    Rebuilding indexes on a big table would require as much resources as the whole previous exercise.

    And now - after renaming Tmp_TestCalendar to TestCalendar:

    DBCC SHOWCONTIG scanning 'TestCalendar' table...

    Table: 'TestCalendar' (1061578820); index ID: 1, database ID: 2

    TABLE level scan performed.

    - Pages Scanned................................: 180

    - Extents Scanned..............................: 27

    - Extent Switches..............................: 26

    - Avg. Pages per Extent........................: 6.7

    - Scan Density [Best Count:Actual Count].......: 85.19% [23:27]

    - Logical Scan Fragmentation ..................: 2.78%

    - Extent Scan Fragmentation ...................: 11.11%

    - Avg. Bytes Free per Page.....................: 40.3

    - Avg. Page Density (full).....................: 99.50%

    Nothing's changed comparing to Tmp_TestCalendar.

    Not a single byte of the data got relocated.

    We've got perfectly allocated and continuous data and index pages.

    Which way you'd suggest to choose for half a TB table?

    _____________
    Code for TallyGenerator

  • If you create a new column, and copy existing data to that column, you don't have to UPDATE every row in a single transaction. And you don't have to drop the table, with all the huge risks involved in that.

    You can batch the UPDATEs based on the clustering key. It's easier if that key is always ascending, but workable even if it's not.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Yes, this is a method I'm using to gradually purge old, un-needed data on various tables ( in this case deleting one at a time )

    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

    SET NOCOUNT ON;

    ---SET DEADLOCK_PRIORITY 10;

    USE COLLxxxxxxxx;

    GO

    DECLARE @errmessage VARCHAR(150);

    DECLARE @msg VARCHAR(300);

    DECLARE @logpercentage BIGINT;

    DECLARE @ErrMsg NVARCHAR(4000) ,

    @ErrSeverity INT;

    DECLARE @rowcount BIGINT;

    DECLARE @counter INT;

    SET @errmessage = 'Starting at ' + CAST(GETDATE() AS VARCHAR(20));

    RAISERROR(@errmessage ,10,1) WITH NOWAIT;

    ---- Begin PurgeProdFile_Record.sql 1 hr 10 min perfqa

    IF OBJECT_ID('tempdb..#tempdataFR') IS NOT NULL

    DROP TABLE #tempdataFR;

    CREATE TABLE #tempdataFR

    (

    ID INT IDENTITY(1, 1)

    PRIMARY KEY ,

    File_Record_ID INT

    );

    INSERT INTO #tempdataFR

    SELECT TOP ( 2000000 )

    FILE_RECORD_ID

    FROM COLLxxxxxxxx.dbo.FILE_RECORD

    WHERE FILE_RECORD.FILE_RECORD_ID IN (

    SELECT FILE_RECORD.FILE_RECORD_ID

    FROM FILE_RECORD

    INNER JOIN MESSAGE ON FILE_RECORD.MESSAGE_ID = MESSAGE.MESSAGE_ID

    INNER JOIN [FILE] ON [FILE].FILE_ID = FILE_RECORD.FILE_ID

    WHERE FILE_RECORD.CONCRETE_TYPE = 'Fdi.Po.ClientFileRecord'

    AND [FILE].CONCRETE_TYPE = 'Fdi.Po.ClientFile'

    AND MESSAGE.CONCRETE_TYPE = 'Fdi.Po.ClientUploadMessage'

    AND FILE_RECORD.MESSAGE_DIRECTION = 'OUTGOING'

    AND FILE_TRANSFORMATION_COMPLETION_STATE = 'EXPORTED'

    AND FILE_RECORD.CREATED_DATE_TIME < GETDATE() - 365 )

    SET @rowcount = @@rowcount;

    SET @errmessage = 'temp table for File_Record has '

    + CAST(@rowcount AS VARCHAR(20)) + ' at ' + CAST(GETDATE() AS VARCHAR(20));

    RAISERROR(@errmessage ,10,1) WITH NOWAIT;

    CREATE INDEX IDX_tempdataFRIdcolumn ON #tempdataFR(File_Record_ID);

    SELECT @counter = MAX(ID)

    FROM #tempdataFR;

    SET @rowcount = 1;

    WHILE @rowcount > 0

    AND @counter > 0

    BEGIN

    BEGIN TRY

    BEGIN TRANSACTION;

    DELETE TOP ( 1 )

    FROM COLLxxxxxxxx.dbo.FILE_RECORD

    WHERE FILE_RECORD.FILE_RECORD_ID IN ( SELECT File_Record_ID

    FROM #tempdataFR

    WHERE [ID] = @counter )

    SET @rowcount = @@rowcount;

    SET @counter = @counter - 1;

    IF ( @counter % 10000 ) = 0

    BEGIN

    SET @errmessage = 'Counter down to '

    + CAST(@counter AS VARCHAR(20))

    + ' in File Record at '

    + CAST(GETDATE() AS VARCHAR(20));

    RAISERROR(@errmessage ,10,1) WITH NOWAIT;

    END;

    COMMIT;

    SELECT @logpercentage = cntr_value

    FROM sys.dm_os_performance_counters

    WHERE instance_name = 'COLLxxxxxxxx'

    AND counter_name = 'Percent Log Used';

    IF @logpercentage > 30

    BEGIN

    SET @msg = ' log more than 30 percent full, waiting 5 min';

    RAISERROR (@msg,10,1) WITH NOWAIT;

    WAITFOR DELAY '00:05:00';

    END;

    END TRY

    BEGIN CATCH

    -- There was an error

    IF @@TRANCOUNT > 0

    ROLLBACK;

    -- Raise an error with the details of the exception

    SELECT @ErrMsg = ERROR_MESSAGE() ,

    @ErrSeverity = ERROR_SEVERITY();

    RAISERROR (

    @ErrMsg

    ,@ErrSeverity

    ,1

    );

    --DECLARE @emailmessage VARCHAR(350)

    -- SET @emailmessage='Purge script failed: ' + @errmsg

    --EXECUTE msdb.dbo.sp_notify_operator @name=N'TestOperator',@body=@emailmessage

    END CATCH;

    --------------WAITFOR DELAY '00:00:05';

    END;

    IF OBJECT_ID('tempdb..#tempdataFR') IS NOT NULL

    DROP TABLE #tempdataFR;

    --- End PurgeProdFile_Record.sql

  • Deleting by a single row is probably not the smartest idea.

    The process by itself creates so much of overhead that deleting by 10 or even 100 rows may not affect the performance of each cycle significantly.

    Estimate how many rows you have in a page and delete by the number of rows which would fit an extent.

    Another thing - for the best effectiveness you need to follow the order defined by the clustered index.

    Removing continuous block of, say, 100 rows from the tail of the table would be insanely faster than 100 of scattered single row removals.

    So, when you choose the records for #table and when you do the actual deletion make sure you force the appropriate order of records in TOP (N) queries.

    _____________
    Code for TallyGenerator

  • Good idea on ordering the temp table. I may experiment with slightly larger deletions but Management's main concern is no blocking/locking.

Viewing 15 posts - 31 through 45 (of 45 total)

You must be logged in to reply to this topic. Login to reply