April 15, 2008 at 3:52 pm
Here's the situation...
I have 18 million rows in a table...now we are adding a few columns to this table.
I need to loop through each row and perform some nasty joins to determine a values for the new fields.
It is system that has no primary keys or referential integrity...so that is partially why the joins are so nasty. (NOT MY DESIGN....I just am stuck with it )
The problem is that it is taking an insane amount of time to perform this operation....which lead to an argument...and this is where I am looking for some proof one way or the other.
We have a cursor for the looping structure.
I created an Update Cursor...and declare local variables for the new values...for each record...update the current row in the cursor with the new values.
A co-worker tells me that it would be faster to curse through the 18M row table and for each row...insert into a new table...appending the new fields at the time of insert. then when we are all done...rename the tables so that the new one...replaces the old one.
I originally tried copying all the data into a temp table...the looping through that...(at least this way I was able to obtain a rowID for each record...making the update faster)...but that still took too damn long and eventually got the InterQueryParallelism error....so I'm trying other avenues...which leads me to using a cursor.
Anyone else have experience using Insert and Update Cursors? Is one better than the other?
Thanks in advance...
April 15, 2008 at 5:09 pm
Your co-worker is, well... wrong! Even if you had the best indexing in the world on all the tables, updating 18 million rows with new data for the new columns would take a very long time. Updating a million rows may take only a couple of minutes, though. There seems to be a "tipping point" in each server having to do with the amount of memory and a couple of other things. By "tipping point", I mean that if 1 million rows takes a minute to update, you would expect 2 million rows to take only 2 minutes.... and it probably will. Same for 3, or 4 million rows depending on how your'e doing the update... might only take 3 or 4 minutes. But then you get to 5 million rows... from the previous pattern, you'd expect it to take only 5 minutes when in reality it may take HOURS!
The way to eliminate the tipping point problem is to make a loop that updates a million or less rows at a time ending only when there's nothing left to update. If you truncate the log in each loop, the transaction log won't get so big, either. Of course, do a backup before you start the massive update.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 15, 2008 at 5:51 pm
Each time I read "18 Million row cursor" in your post - there's a chill going down my spine, because that will take days to process.
Is there a way you can describe what you're doing as part of this "ugly process"? I can't help but think that something set-based applied against a "walking subset" of the table to update, say, 50 to 100 thousand rows at a time, will get through it WAY faster that anything running 18 million individual row updates. There are LOTS of people here who are more than dead-set against cursors, who would likely be happy to help you come up with something set-based to do this update for you.
I don't know if you've tried this yet:
- insert all of the existing data into a table with the existing data and blank fields for the new stuff. have the identity auto-generate itself, and make it the clustered index (for now, might change later).
- use a process that takes chunks of those identities and updates them, much preferrably using something set-based. So - update rows 1-50,000, then 50,001-100,000, etc.... until all rows are updated.
Really - give us a rundown of this "ugly process" - I would venture a guess there's a way to do it....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 15, 2008 at 6:36 pm
I agree with you 100 percent...if I can figure out a way to make a set based update instead...I would be so happy. I do my best to avoid cursors every chance I get...usually looping through a table variable is what I end up doing...as long as the data set isn't too large...then I'll use a temp table...
The good news is that I was able to determine that 12.5 million records won't have values after the "nasty join"...so I was able to eliminate them from the data set...so now I'm down to 5.5 million...a little better...
A little background....
These records are medical claims....I am joining to a transaction table which is joined to itself (so I can get the most recent transaction for each claim based on a transaction timestamp)...it's a clumbsy way to do it...but as I said earlier...there are no PK's to join on....so it's the only way I can figure it out....
The new field I am appending to the claims record is a coverage code that will tell me what pay structure the claim was paid under...
Now since a particular benefit plan can change over time...a coverage code of xyz can mean different pay structures...depending on the claim date...so once I get the coverage code appended to the claim record...I can join to the benefit table on the coverage code and claim date.
It gets confusing pretty quick..and it drives me nutz daily. anyway that is the ugly process that I am trying to do for each record.
I tried using a set based query instead of a loop...but since I have to tie in the fields necessary to get the appropriate coverage code....including timestamp and ssn...any record set I came up with....was still unique to a single record...
If there is a better way..I am all ears...Ideally I could redesign the whole system and implement some normalization....but I am stuck with the current system for now.
And for the most part...I agree that Cursors are Bad...and there is almost always a better way to do things...but I tried my normal method of not using a cursor...and it wasn't too productive...so I figured I'd give it a shot...
Thanks for your suggestion of another method...I did try that...before I used the cursor. It was crunching the data....but still took a long time.
I'll keep plugging away...and trying new ways....
Thanks again
I don't know if you've tried this yet:
- insert all of the existing data into a table with the existing data and blank fields for the new stuff. have the identity auto-generate itself, and make it the clustered index (for now, might change later).
- use a process that takes chunks of those identities and updates them, much preferrably using something set-based. So - update rows 1-50,000, then 50,001-100,000, etc.... until all rows are updated.
Really - give us a rundown of this "ugly process" - I would venture a guess there's a way to do it....
April 15, 2008 at 7:24 pm
[Quote]there are no PK's to join on[/Quote]
So, the is no relational database really...
And because of this database tools are not applicable.
All you've got is set of flat files (don't be fooled by name "table", witout keys and constraints it's just flat files) and best thing you can do is to treat it as flat file.
That means - read a line from a file, scan another file in order to find a relevalt line (lines), update the original line, loop.
Yes, cursor, and it will ALWAYS take forever.
Or -
you may try to make a database from your data dump and use RDMS tools to get the same result in no time.
Which way to chose - it's up to you.
_____________
Code for TallyGenerator
April 15, 2008 at 7:36 pm
one thought - since you have the claim number and timestamp - you've got a reasonable start at a PK you just haven't used yet. Turn that into your PK (or at least see).
using that as an index at very least - you could create a table of unique claims and max timestamp per claim, which then gets you your coverage code.tie that back in and tackle the pay structure. I've been on the periphery of that nonsense for years, so I have some rough idea of what you've having to take on, but like Sergiy is saying - now's the time to take this sucker by the horns and tame it into an actual relation DB. The coding will likely take a little longer, but putting it off will only increase your pain in the end....
Nevermind that it likely came from one of those god-awful hyerarchical disasters.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 15, 2008 at 7:38 pm
exactly...a bunch of completely denormalized "tables" (term used loosely)
management won't let me change the existing schema...since there is so much already built on top of the existing mess.
So I'm back to looking for another job because the thought of spending years at a place like this numbs me.
I appreciate everyones comments...it at least validates my initial thoughts...there isn't a single magic way of doing what I want...given the limitations of the current design.
Since it's only a one time update...I'll just have to plow through the data...as slow as it is...at least I don't ever have to do it again....and have only re-enforced my firm beliefs in normalization and relational data.'
thanks again
April 15, 2008 at 7:50 pm
Matt,
Thanks...that is another great idea...I'll work with that tomorrow...see if I can make it work.
I'm fine with all the denormalized tables...for reporting purposes...it makes the data retrieval really quick...but only if this is in addition to a normalized schema...taking this job was a huge step back in time...like before relational tables were conceived...which happens to be about the time that the forefathers of this damn system seemed to be in their prime...
Thanks again...I'll see if the claimnumber and timestamp will work...it just may be the answer I am looking for!
Joe
April 15, 2008 at 8:16 pm
[Quote]Since it's only a one time update...[/Quote]
Don't be fooled with such promises.
It never happens.
Don't those other tables get updated time to time?
You gonna need to update that "main" table with new data accordingly, right?
And other reports will require another way of denormalisation, with following regular updates.
And so on, until the server cannot finish overnight updates overnight.
Run away as quick as you can.
Print my previous comment (and this one, probably ;)) and leave them.
When they'll see it's coming they'll probably call you - then charge them really infairly. 😎
_____________
Code for TallyGenerator
April 15, 2008 at 9:29 pm
isn't that the sad truth....
damn them once for creating the whole mess...
damn me for not jumping ship as soon as I realize what a steaming pile I jumped head first into;....
My days are numbered...
April 16, 2008 at 1:22 am
Hey Highway! Don't give up, yet... turn this into a learning experience because Sergiy is correct... no matter where you go, you're going to run into steaming piles of rat patooti just like in this job.
Take a look at the following... it's an example of what I use to delete millions of rows from a table... as I suggested before, it doesn't do it all it once... it "crawls" through the delete in sets of rows instead of using the full RBAR of a cursor. Modify it to do your update! It also generates some test data you can play with... READ THE COMMENTS to learn how it works!
--===== If the test table exists, drop it
IF OBJECT_ID('dbo.JBMTestDetail','U') IS NOT NULL
DROP TABLE dbo.JBMTestDetail
GO
--===== Create and populate a 1,000,000 row test table.
-- Column "ConnID" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)
-- Column "Key1" has a range of 1 to 100,000 non-unique numbers stored as VARCHAR(30)
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "Time_Stamp" has a range of >=01/01/2005 and <01/01/2015 non-unique date/times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
SELECT TOP 1000000
ConnID = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(30)),''), --(10 rows per connection)
Key1 = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%100000+1 AS VARCHAR(20)),''), --just to test index with
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
Time_Stamp = ISNULL(CAST(RAND(CHECKSUM(NEWID()))*3652.0+38351.0 AS DATETIME),0),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTestDetail
FROM Master.dbo.SysColumns t1,
Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN
--===== Create indexes similar to Troy's
CREATE CLUSTERED INDEX IXC_JBMTestDetail_Time_Stamp_ConnID ON dbo.JBMTestDetail (Time_Stamp,ConnID)
CREATE NONCLUSTERED INDEX IX_JBMTestDetail_ConnID_Key1 ON dbo.JBMTestDetail (ConnID,Key1)
GO
--===== Setup to measure performance...
SET STATISTICS TIME ON
--===== Define the cutoff date with a time of "midnight" or, if you will,
-- define the cutoff date with no time so we only delete whole days.
DECLARE @CutoffDate DATETIME
SELECT @CutoffDate = DATEADD(dd,DATEDIFF(dd,0,GETDATE()),-94)
--===== Limit all further queries, including deletes, to 25,000 rows
-- (about 1 second worth of deletes, like I said before)
SET ROWCOUNT 25000
--===== See if any rows qualify for deletion. If even just one exists,
-- then there's work to do and @@ROWCOUNT will be > 0.
-- DO NOT PUT ANY CODE BETWEEN THIS SELECT AND THE WHILE LOOP OR
-- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP
SELECT TOP 1 1 FROM dbo.JBMTestDetail WHERE Time_Stamp < @CutoffDate
--===== If the rowcount from the above is greater than 0,
-- then delete 25,000 rows at a time until there's nothing
-- left to delete
WHILE @@ROWCOUNT > 0
BEGIN
--===== Just a "marker" to separate the loop in the output
PRINT REPLICATE('=',78)
--===== This delay gives other processes breathing room
WAITFOR DELAY '00:00:10'
--===== Do the delete. Will be limited by the SET ROWCOUNT above.
-- DO NOT PUT ANY CODE BETWEEN THIS DELETE AND THE "END" OR
-- YOU'LL MESS UP THE ROWCOUNT FOR THE WHILE LOOP.
DELETE dbo.JBMTestDetail WITH (TABLOCKX)
WHERE Time_Stamp < @CutoffDate
END
--===== Restore the ability to process more than 25,000 rows
SET ROWCOUNT 0
SELECT 1000000 - COUNT(*) FROM jbmtestdetail
--Jeff Moden
Change is inevitable... Change for the better is not.
April 16, 2008 at 8:06 am
Thanks alot!
I will work with it today...I have plenty of time...since I am letting the previous code crunch data...I'm sure it will take at least all day...
:>)
I love this place...at work I am the only SQL guy...and it is so helpful to have a place of other SQL professionals to bounce ideas off of...
Much appreciated.
Joe
April 16, 2008 at 7:17 pm
You bet... let us know how things work out for you...
--Jeff Moden
Change is inevitable... Change for the better is not.
April 17, 2008 at 7:32 am
Well,
I took the logic you used in you code...I have another field...cycledate...which happens to be indexed..so I built an outer loop structure....and for each cycledate...I process those claims...it's not exactly using the RowCount 25000 method...but I do limit the claims I'm processing to 2 weeks worth...so it helps a bit.
Since I didn't see any improvement by using it...I trashed the cursor...and went back to my first choice of looping through a copy of the original table populated with only the current cycles claims.
Then for each of those records..I do my join to the transactions.get my new values...and insert into a final table..
it's still not set based....but the good news is that now management sees why their design sucks..and I have been given a green light to begin working on normalization...
I guess you do have to be careful of what you wish for!!!
Thanks again all,
Joe
April 17, 2008 at 8:07 am
Highwayman (4/17/2008)
but the good news is that now management sees why their design sucks..and I have been given a green light to begin working on normalization...
Then it all worked out! Congratulations on your new full time job. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply