Updating millions of records

Question

Updating millions of records

SQLRO

SSCrazy

Points: 2923
More actions
March 24, 2011 at 1:36 am

#389125

Hello,
We have data processing team in our organization which needs to work with millions of records everyday. Now the number of records have reached 7.5 million and soon it is expected to reach higher levels every month.
We have a process which concatenates / updates / matches the char and varchar columns in the table. (it is a single lookup table)
In recent times the query to update the columns using functions like left / right etc. takes more than 12 hours to complete. An example of the query is given below.
update table_name
set address = address1 + ' ' + address2 + ' ' + address3
update table_name
set address_part = left(address,11) -- This query took more than 15 hours to complete.
The data type of Address column is varchar(255)
One of the ex-associates of our company had created a non-clustered index on the address column. We dropped the index before the first query and recreated. (The column is also used in select queries once these columns are populated). The index was very much there while executing the second query which took more than 15 hours.
Is it the non-clustered index which caused the problem because it was created on the varchar column?
Is there a better way to do these kind of updates?
Thanks in advance.
Rohit

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

There's no where clauses on any of this, you're mass updating in multiple passes... Yeah, that's gonna hurt.

The nc index really only hurts if you're updating something in the NC index (included or part of the tree), because you have to update multiple times. It's the cost of indexing. It's *usually* not that bad.

I have to ask, are you bringing in 7.5 million records a day, or is that your total dataset?

Do you not carry something like 'lastUpdateStamp' as a timestamp so you know which addresses need to be changed and which are still good?

If you're going to flat out update 7.5 million records daily, it might be better if we could see the DDL for the table and all the indexes, and the entire update list you're performing as a daily maintenance. Then we'll probably ask some questions about your expected per day data load/change methodology.

You'll probably end up doing some partitioning, some index manipulation, maybe an extra column or two for change identifications, things like that. We should be able to help you clean this up, but it's going to take a bunch of details.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 2

1) I would consider doing both of those updates in a single pass.

2) since you are potentially increasing the length of the address field, you are probably getting lots of page splits and/or have VERY fragmented table and index structures. Do you run regular index defrags? Also, a lower fill factor could help avoid the page splits, but at the expense of more reads if you do scans of the table

3) It is almost certainly the case that you don't need to update every record all the time. Quite possible you only need to update a very small fraction of the rows (say newly loaded ones). If that is the case, add an "isupdatable" bit column and use that to filter the updates. Conversely you could simply add a where clause that is a <> filter for the actual column being updated compared to what you are about to update it to. Still a lot of reads for the table scans and CPU burn for the comparison, but if you only update a relatively small percentage of rows likely a huge perf gain.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

SQLRO SSCrazy Points: 2923 More actions · Answer 3

Hello,

Thanks for your replies. We are getting the updates every month. We have a encrypted database and need to run certain tools to get a lookup tables.

As both of you have suggested we are planning to take only increamental records the the table and then process the address part.

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 4

Rohit

You're breaking the laws of normalisation by having redundant data in your table. Why not change the columns you are updating to computed columns instead?

John

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

John Mitchell-245523 (3/25/2011)
Rohit
You're breaking the laws of normalisation by having redundant data in your table. Why not change the columns you are updating to computed columns instead?
John

Computed columns would have them paying the price for computations over the course of the entire month. Probably best to take the hit once, unless the data is infrequently accessed.

Oh, one other point to the OP: I would add into your monthly "prep-the-data" process to defrag (possibly completely rebuild?) indexes after your update process. Likewise, if you don't drop all indexes before you update all records, probably best to do that and then recreate when done.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 6

TheSQLGuru (3/25/2011)
Computed columns would have them paying the price for computations over the course of the entire month. Probably best to take the hit once, unless the data is infrequently accessed.

Fair point, although you have to balance that against the extra space that all this extra character data will take up in the table, meaning more page reads required in order to return any requested data. The original poster would need to take everything into account and test to see what gives the best performance. Another alternative is to dispense with those columns altogether and have the presentation layer (if there is one) do the string manipulation.

John

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

John Mitchell-245523 (3/25/2011)
TheSQLGuru (3/25/2011)
Computed columns would have them paying the price for computations over the course of the entire month. Probably best to take the hit once, unless the data is infrequently accessed.
Fair point, although you have to balance that against the extra space that all this extra character data will take up in the table, meaning more page reads required in order to return any requested data. The original poster would need to take everything into account and test to see what gives the best performance. Another alternative is to dispense with those columns altogether and have the presentation layer (if there is one) do the string manipulation.
John

Assuming seeks are the predominant type of access extra space won't matter much for performance.

Having presentation layer do string stuff could be big win if have scale-out model such as multiple app/web servers or it is done directly on client machines with fat app.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004407 More actions · Answer 8

SQLRO (3/24/2011)
Hello,
We have data processing team in our organization which needs to work with millions of records everyday. Now the number of records have reached 7.5 million and soon it is expected to reach higher levels every month.
We have a process which concatenates / updates / matches the char and varchar columns in the table. (it is a single lookup table)
In recent times the query to update the columns using functions like left / right etc. takes more than 12 hours to complete. An example of the query is given below.
update table_name
set address = address1 + ' ' + address2 + ' ' + address3
update table_name
set address_part = left(address,11) -- This query took more than 15 hours to complete.
The data type of Address column is varchar(255)
One of the ex-associates of our company had created a non-clustered index on the address column. We dropped the index before the first query and recreated. (The column is also used in select queries once these columns are populated). The index was very much there while executing the second query which took more than 15 hours.
Is it the non-clustered index which caused the problem because it was created on the varchar column?
Is there a better way to do these kind of updates?
Thanks in advance.
Rohit

Every system has a "tipping point" and I suspect you've reached it. Break the query up into million row updates.

By "tipping point" I mean a million row update may take just several seconds. A two million row update may only take twice that. A 3 million row update may take 2 to 4 hours. IE: Tipping point reached.

Divide'n'Conquer.

P.S. Are you doing anything strange during nightly maintenance such as shrinking the log?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRO SSCrazy Points: 2923 More actions · Answer 9

basically we do not have any application accessing this table therefore we did not plan for normalization.

We are dropping the indexes and recreating it every month.The next month onward we are planning to rebuild the indexes on the same table as we will be passing on only increamental updates to this table rather replacing entire set of records.

SQLRO SSCrazy Points: 2923 More actions · Answer 10

Hi Jeff,

We do srhink the log but not every night. We receive the data in lots (45K to 400k) which needs to be matched with the data in this table.

Once the matching process is completed for one lot we shrink the log as it grows to aquire few GBs of the disk space.

Jeff Moden SSC Guru Points: 1004407 More actions · Answer 11

SQLRO (3/27/2011)
Hi Jeff,
We do srhink the log but not every night. We receive the data in lots (45K to 400k) which needs to be matched with the data in this table.
Once the matching process is completed for one lot we shrink the log as it grows to aquire few GBs of the disk space.

I'm not sure if you can do it but update the 7.5 million rows in smaller chunks of a million rows or so. Like I said, it sounds like you may have hit the "tipping point" for your system.

Also, stop shrinking the log. It's just going to grow again. If you're not doing log backups at least twice a day, you may want to consider either a different recovery model other than full or maybe just move all of this work to another database with a simple recovery model.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRO SSCrazy Points: 2923 More actions · Answer 12

Thanks Jeff! We have changed the database recovery model to Simple for this database. I will check how much the log grows after the respective team runs their matching queries.

Jeff Moden SSC Guru Points: 1004407 More actions · Answer 13

SQLRO (3/28/2011)
Thanks Jeff! We have changed the database recovery model to Simple for this database. I will check how much the log grows after the respective team runs their matching queries.

You're still going to need to split up the updates even after that. Most systems have a tipping point between 2 and 4 million rows depending on the number of columns being updated.

On my home system, I updated 1 million rows in 7 seconds. On 2 million rows, it took about 14 seconds, as expected. On 3 million rows, it took more than 20 minutes.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRO SSCrazy Points: 2923 More actions · Answer 14

Thanks a lot.. This information is really valuable.I will check out the same. Will post it here if find anything interesting.