December 16, 2009 at 4:43 pm
I have a table which takes around 12 GB on disk space and has nearly 70 columns with different data types. The row count is more than 17 mil. None of the column contains any varchar(max), text or any bigger datatype.
Now i have to add 6 more columns to this table. After adding those columns i have to update the columns based on other columns in same table.
it takes lot of time to update a single column in one statement, nearly 1 hour. I have to run update statements for 6 columns and i dont think i should waste 6 hours on it.
I am thinking of taking a different approach which i haven't tested yet and that's the place where i need your opinions. I am thinking of using SELECT INTO with existing table to new temp table with all the 6 new columns. Once completed, i'll drop the existing table and rename the new temp table to existing one.
But with this approach i have to take care of identity columns and all the contraints, indexes on the new table. Is it worth enough following the new approach rather than using update statements. If yes, anything i have to take care of which i missed.
December 16, 2009 at 5:03 pm
If the update to those columns is based on data from columns in the same row, why not just create computed columns and call it a day with no updates?
By the way... the problem you're running into is called the "tipping point" and every system has one. My recommendation would be to split the job up so you update about a million rows at a time. Depending on what you're actually doing and how many indexes you have and the shape of your clustered index and if you have any triggers, you should be able to update a million rows about every 7 seconds. And, no... that's not a typo.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2009 at 5:46 pm
Before answering to your comments, i'll like to tell more about what i am doing. I am migrating data with transformations on the staging database and then move the data to destination database. So i have pulled data from different source databases and based on business requirements, i have created this big table in the staging database. Once i am done with all the transformations, i'll migrate the clean data to the destination database.
Jeff Moden (12/16/2009)
If the update to those columns is based on data from columns in the same row, why not just create computed columns and call it a day with no updates?
Yes, these new columns will be updated based on columns values in same row. But these can't be a computed column as the value is based on lots of different business scenarios and i am using a CASE update statement for updating this column.
The update process of these new columns will be only a once off job as no more data will be added to this staging database table.
By the way... the problem you're running into is called the "tipping point" and every system has one. My recommendation would be to split the job up so you update about a million rows at a time. Depending on what you're actually doing and how many indexes you have and the shape of your clustered index and if you have any triggers, you should be able to update a million rows about every 7 seconds. And, no... that's not a typo.
I read about "tipping point" after your post and it was interesting to find how big impact it can have on the execution timings. Now about the table informaiton, there are 3 indexes all of them non clustered, where 1 is unique non clustered acting as a primary key. (a question in the last of post based on type of index). No triggers defined on the table.
About running in batches, do you mean by a script which i have to write keeping track of number of rows updated in the table and then run the update statement? If you have any reference handy about the pros and cons of using this approach, that will be great. I am in a habit of doing my research first rather than getting onto machine.
Question about choice of clustered and non clustered index. I have been told by my seniors that it is not a good practise to have clustered index on large tables when you perform updates on columns? One can have non clustered indexes which perform better on big tables. Is it TRUE?
December 16, 2009 at 6:01 pm
you'd simply change your update statement just a little bit, so it only updates where the destination column is null;
something like this:
SET ROWCOUNT 500000 --only fiddle with half a million rows at a time.
WHILE 1=1
BEGIN
UPDATE MYTABLE
SET DESTCOLUMN = CASE.....
END
FROM OTHERTABLES
WHERE ...
AND MYTABLE.DESTCOLUMN IS NULL
--add a bailout condition
IF @@ROWCOUNT = 0
BREAK
END
SET ROWCOUNT 0
you just keep repeating until you are done.
Lowell
December 16, 2009 at 6:33 pm
Lowell (12/16/2009)
you'd simply change your update statement just a little bit, so it only updates where the destination column is null;
Lowell, as the destination column is a new column added to the table, all the values will be NULL, then why is there a need to restrict the destination column to null values only?
December 16, 2009 at 6:44 pm
Is it just because i don't end up updating same records again and again?
December 16, 2009 at 7:07 pm
Lowell, based on your example code, i have written my query below after removing case statements and making it simple to understand and clarify the logics:
SET ROWCOUNT 500000 --only fiddle with half a million rows at a time.
WHILE 1=1
BEGIN
UPDATEBH
SETBH.xx = VN.yy
FROM dbo.aa BH
INNER JOIN dbo.bb VN ON BH.recordid = VN.recid
WHEREBH.xx is null
IF @@ROWCOUNT = 0
BREAK
END
SET ROWCOUNT 0
Here BH.xx is the newly added column which will be updated with VN.yy
A big question, what if VN.yy computes a value as NULL or records which are not considered in the join, then these records will be included in next 500000 batch run as well. How can i get rid of this record as well. Will it be fine if i use another column BH.recordid. The idea will be keep track of recordid into a variable and update records based on BH.recordid variable value and BH.xx is null.
December 16, 2009 at 9:26 pm
yes, you could use a loop from one to twenty or something, so it only iterates thru ranges of id's; you were right, the idea is to not let it process the same records over and over again.
i just assumed that every column got updated.
chances are you could use the counter and the same kind of update statement, you know,
like WHERE recid between (@counter * 500000) and (@counter * 500000) + 500000
not sure on the math on that, it's late, but it seems like the right idea.
something like this?
declare @i int
SET @i = 0
SET ROWCOUNT 500000 --only fiddle with half a million rows at a time.
WHILE 1=1
BEGIN
UPDATE BH
SET BH.xx = VN.yy
FROM dbo.aa BH
INNER JOIN dbo.bb VN ON BH.recordid = VN.recid
WHERE BH.xx is null
--first iteration between 1 and 500000
and BH.recordid BETWEEN (@i * 500000)+ 1 and (@i * 500000) + 500000
SET @i = @i + 1
--either everything is updated
IF @@ROWCOUNT = 0
BREAK
--or we whipped thru 20 iterations
IF @i > 20
BREAK
END
SET ROWCOUNT 0
Lowell
December 17, 2009 at 8:53 pm
Is it really possible to get the execution time dropped to nearly 10 seconds from 1 hour after running updates in batches?
Thats what i am getting and i am not getting confidence on whether i am doing correct or not :unsure:
December 17, 2009 at 10:32 pm
yes! that is the "tipping point" Jeff was referring to; most huge updates take seconds/under a minute total, isntead of hours.
after your update, have you done a select to confirm the data changed?
at a certain point the number of rows being updated/inserted/deleted can throw the execution plan to hell, and it becomes grossly inefficient/slow.
do a search for tipping point here on SSC and you'll see lots of similar situations.
Lowell
December 24, 2009 at 4:46 am
Hi and sorry for my english
I think that you are facing with the problem of "forwarded records".
Once you update new columns in a heap (no clustered index), the records don´t fix into the page they belong to and sql server must use another page for each record. SQL Server will add a pointer to the forwarded record in the table.
To confirm this situation, you can query sys.dm_db_index_physical_stats over the table and check the column "forwarded_records".
The work-around to this problem is create a clustered index in the table. This will rebuild the table and remove this forwarded records. Then if you want, you can drop that index, but therefore you can end with the same situtation.
Be aware that not only the updates are being affected by this records. Every query that performs some kind of SCAN in the table will be seriosly affected becouse SQL Server must read all the remote pages created by the forwarded records and besides the greater number of pages, this scan will be much less sequential that it should be and you have more random scans and this will affect to the performance.
hope it helps
January 11, 2010 at 7:16 pm
i am facing a problem with updates in batches. What i thought earlier and what my understanding is that if i am running an update on a table in batches or in single statement, it will update the records in sequential manner (asc) if there is a clustered index (asc) defined.
Now when i run the update statement in batches of 500000, the first record to be updated is 500000.
As far i know that we can't use ORDER BY clause with update statement. Is it possible to force the update statement to start from 1 to 500000 for first batch run and then from 500001 to 1000000 and so on?
January 11, 2010 at 7:28 pm
yes Anam, you can...the last post i made with a code example did exactly that...the code was assuming there is a column named recordid, and it was the PK identity in the table. it's using a simple counter, and doing recordid's between two values, untill nothing updates any more OR 20 looped iterations occur, whichever occurs first.
here it is again:
declare @i int
SET @i = 0
SET ROWCOUNT 500000 --only fiddle with half a million rows at a time.
WHILE 1=1
BEGIN
UPDATE BH
SET BH.xx = VN.yy
FROM dbo.aa BH
INNER JOIN dbo.bb VN ON BH.recordid = VN.recid
WHERE BH.xx is null
--first iteration between 1 and 500000
and BH.recordid BETWEEN (@i * 500000)+ 1 and (@i * 500000) + 500000
SET @i = @i + 1
--either everything is updated
IF @@ROWCOUNT = 0
BREAK
--or we whipped thru 20 iterations
IF @i > 20
BREAK
END
SET ROWCOUNT 0
Lowell
January 11, 2010 at 7:40 pm
Lowell (1/11/2010)
yes Anam, you can...the last post i made with a code example did exactly that...the code was assuming there is a column named recordid, and it was the PK identity in the table. it's using a simple counter, and doing recordid's between two values, untill nothing updates any more OR 20 looped iterations occur, whichever occurs first.here it is again:
declare @i int
SET @i = 0
SET ROWCOUNT 500000 --only fiddle with half a million rows at a time.
WHILE 1=1
BEGIN
UPDATE BH
SET BH.xx = VN.yy
FROM dbo.aa BH
INNER JOIN dbo.bb VN ON BH.recordid = VN.recid
WHERE BH.xx is null
--first iteration between 1 and 500000
and BH.recordid BETWEEN (@i * 500000)+ 1 and (@i * 500000) + 500000
SET @i = @i + 1
--either everything is updated
IF @@ROWCOUNT = 0
BREAK
--or we whipped thru 20 iterations
IF @i > 20
BREAK
END
SET ROWCOUNT 0
Lowell, i used the same code you had given here and it helped me in lots of update statements.
Now the update statement i am running is in turn firing a trigger to insert records into 2 different tables. Now when the trigger inserts the record into first table, the first record inserted is with recordid 500000. Does the SQL Server behaves in some different way which i am not aware off?
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply