June 23, 2013 at 9:00 am
I have a recurring theme problem I'm trying to help solve where I work. We take a lot of historical line item transactional data and move it through various phases within SQL: Raw (unedited), Staging (for customer review), Production (after approval of staging).
While the data is large in volume, we're not talking about Google or Twitter volumes of data. I'm talking about millions of rows of charge detail history for medical software.
What I'm trying to best understand is the way to get things from point A to B with the minimal time and effort involved. Part of the issue is transactionally (volume of executions simultaneously running on the same server) and part of it is syntax (am I writing these movements most efficiently).
What I'd love is some overall help to understand where I might be gaining more efficiencies in some of the basics. An example I'm working through this weekend is the following:
Part of this detail history involves first cross checking field lengths, etc, to ensure accuracy before moving information into staging. One of the final steps for me is to concatenate three fields into the final patient account number on all the detail line items in history. When I first wrote the statement, it literally looked like the following:
update temp_raw_chargedetail
set Encounternumber = Field1+Field2+Field3
One of my coworkers (former DBA/SQL guru) rewrote it quickly for me to the following recently:
Set RowCount 100000
update temp_raw_chargedetail
set Encounternumber = Field1+Field2+Field3
Where Encounternumber is Null
While @@ROWCOUNT>0
Begin
update temp_raw_chargedetail
set Encounternumber = Field1+Field2+Field3
Where Encounternumber is Null
End
Set RowCount 0
In initial tests, it is running quicker overall, but just this latter statement alone has been running on a 27 million record set of rows since 2 pm on Friday afternoon and here it is 10 am on Sunday. There is nothing else running on this server right now other than select statements I've been running to see where other databases/accounts stand. I just have a feeling something else could be done to mitigate some of this timing as it feels like I'm waiting a ridiculous amount of time for something that would seem somewhat superficial to complete processing.
Any initial thoughts? I'm happy to provide additional details to start narrowing down how I could be thinking things through more efficiently. Thanks!
June 23, 2013 at 6:58 pm
Most likely, the "Devil is in the Data". Are the columns in the table for Field1, Field2, and Field3 constrained as being "NOT NULL"? If not, that's likely your problem because the default connection setting (and you should NOT change it) is "Concatenate NULL yields NULL". If even 1 row has a NULL in it for any of the 3 concatenated columns, the script will run forever because the loop will ALWAYS find at least 1 row that is NULL.
Here's how I would rewrite the query especially since ROWCOUNT has been deprecated and we need to start practicing the alternatives. As a sidebar, this didn't take that much longer to write (although it IS untested because I don't have your data) than the original which told you nothing of the progress/lack of progress it was making. It's always worth spending the little extra time for things such as this so you don't have to guess. 😉
I've also included a "trick" to make it a bit faster still. As usual, details are in the comments in the code.
--===== Declare and preset some obviously named variables
DECLARE @LoopSize INT,
@LoopCount INT
SELECT @LoopSize = 100000,
@LoopCount = 1
;
--===== If we're in the first iteration or we just updated the max number of rows,
-- then continue to update according to the @LoopSize. If we ever update less
-- than the @LoopSize, then we're doing the final update. Checking for
-- @LoopSize > 0 would cause an extra and very long interation because it
-- takes longer to find nothing in 27 million rows than it does to find a TOP
-- number of rows. The last interation will be long enough as it is. No need
-- for an extra long iteration that isn't going to find anything.
-- It's also nice to have a little feedback so you know what's going on.
-- That's what the RAISERRORs are being used for. It's a trick to print the
-- progress. Normal print statements might not display until the end of the run.
WHILE @@ROWCOUNT = @LoopSize OR @LoopCount = 1
BEGIN ---------------------------------------------------------------------------------------------------------------
RAISERROR ('Working on interation #%u...',0,1,@LoopCount) WITH NOWAIT;
UPDATE TOP (@LoopSize) dbo.temp_raw_chargedetail --Best practice, 2 part naming
SET EncounterNumber = ISNULL(Field1+Field2+Field3,'***ERROR***') --I think this will fix your problem
WHERE EncounterNumber IS NULL
;
RAISERROR ('%u Rows Affected...',0,1,@@ROWCOUNT) WITH NOWAIT;
RAISERROR '----------------------------',0,1) WITH NOWAIT;
SELECT @LoopCount = @LoopCount + 1
;
END ---------------------------------------------------------------------------------------------------------------
;
--======= List any errors
RAISERROR ('--------------------------------------------------------------',0,1) WITH NOWAIT;
RAISERROR ('The following rows have NULLs for Field1, Field2, or Field3 or',0,1) WITH NOWAIT;
RAISERROR ('failed to process for another reason.',0,1) WITH NOWAIT;
RAISERROR ('--------------------------------------------------------------',0,1) WITH NOWAIT;
SELECT *
FROM dbo.temp_raw_chargedetail
WHERE EncounterNumber = '***ERROR***'
OR EncounterNumber IS NULL
;
--======= Notify the operator that we're finally done!
RAISERROR ('****************************',0,1) WITH NOWAIT;
RAISERROR ('*** RUN COMPLETE ***',0,1) WITH NOWAIT;
RAISERROR ('****************************',0,1) WITH NOWAIT;
Of course, a much better trick for the Dev and Staging environments might be to use a persisted calculated column for this, instead. That way, when you populate the 3 other columns, the result in the EncounterNumber will "auto-magically" be populated.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply