June 19, 2018 at 12:03 am
Hi
I am tasked with building a stored procedure which performs inserts, merges, etc.
What i have now is as follows.
create procedure sp_merge_data
as
begin
set xact_abort on
merge into dest_lookup_table select lookup_values from stg_table;
merge into dest_table select * from stg_table;
when matched then ....
when not matched then...;
end;
This is working as expected. However i was thinking of logging information into a log table so i can track the status of the procedure as it runs .
create procedure sp_merge_data
as
begin
set xact_abort on
declare @batch_id int;
set @batch_id= cast(currrent_timestamp() as int);
BEGIN TRY
BEGIN TRANSACTION /* Being transaction of the main block*/
BEGIN TRANSACTION
--Transaction starts for logging values into another table..
insert into log_table values(@batch_id,'Starting merge dest_lookup', current_timestamp(),null,0);
COMMIT TRANSACTION
-- do the merge into the destination table...
merge into dest_lookup_table select lookup_values from stg_table;
BEGIN TRANSACTION
-- update the row count of the affected statement and also the end time of the merge
update log_table
set end_time=current_timestamp()
,row_count=@@ROWCOUNT
where batch_id=@batch_id
COMMIT TRANSACTION
--Start the next logging
set @batch_id= cast(currrent_timestamp() as int);
BEGIN TRANSACTION
--Transaction starts for logging values into log table .
insert into log_table values(@batch_id,'Starting merge dest_table', current_timestamp(),null,0);
COMMIT TRANSACTION
--Perform the merge into another destination table..
merge into dest_table select * from stg_table;
when matched then ....
when not matched then...;
BEGIN TRANSACTION
-- update the row count of the affected statement and also the end time of the merge
update log_table
set end_time=current_timestamp()
,row_count=@@ROWCOUNT
where batch_id=@batch_id
COMMIT TRANSACTION /*If all goes well commit the main transaction*/
END TRY
BEGIN CATCH
ROLLBACK;
RAISERROR (ERROR_PROCEDURE)
END CATCH
end;
wanted to check if this is a good approach to logging data of a stored procedure or if there is any better alternatives
Thanks
George
June 19, 2018 at 1:04 am
I've worked in places where similar audit methods have been used and to be honest, they don't provide a great deal of useful information. You know when the code was executed and how many rows it affected but that's it - not what changes it actually did.
I found a more useful method was using the OUTPUT clause to write the data to an audit table, with timestamp and an indication of the action taken (INSERT/UPDATE/DELETE). Then the life-cycle of the row can be extracted, showing all data changes.
An example of using MERGE with OUTPUT is here, because I could never remember that syntax for more than five minutes.
June 19, 2018 at 1:26 am
BrainDonor - Tuesday, June 19, 2018 1:04 AMI've worked in places where similar audit methods have been used and to be honest, they don't provide a great deal of useful information. You know when the code was executed and how many rows it affected but that's it - not what changes it actually did.
I found a more useful method was using the OUTPUT clause to write the data to an audit table, with timestamp and an indication of the action taken (INSERT/UPDATE/DELETE). Then the life-cycle of the row can be extracted, showing all data changes.An example of using MERGE with OUTPUT is here, because I could never remember that syntax for more than five minutes.
Thanks for your reply. I learned something new today here today ->the OUTPUT clause.
I am tracking the changes to the rows via the data lineage fields (created_by,created_on,updated_by,updated_on) fields in the destination tables of the merge SQL.
Here in the procedure what i am trying to track down is,if there is any error in the procedure i should get to know which portion of the procedure has cause the exception
Eg: In the above stored procedure
If the 1st merge has succeeded i would have an entry in the log_table and if the stored procedure errored out, then i would know that the issue is caused between the 1st merge and COMMIT portion of the stored procedure.
The stored procedure is being called from a 3rd party app during a workflow process, and they seems to have limited ability/arcane configuration settings which can successfully throw back the error message of a stored procedure output to the caller of the stored procedure. So i was planning to even insert the exceptions to another table as well. Not sure if i am going down the right path here though.
June 19, 2018 at 2:29 am
Only concern I have is that looks like a big transaction. Will there be multiple copies of this running at the same time? If so, you could see a lot of blocking across all the tables affected.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
June 19, 2018 at 10:12 am
After Rollback you could update the log with the error in the CATCH. Using transaction variables (named transactions) could help you quickly identify where in the procedure an error occurs and allow logging more accurately what happend like ERROR_MESSAGE() to help catch and correct errors.
June 19, 2018 at 3:03 pm
Heh.... "Updated_By". One of the most commonly overlooked reasons for massive fragmentation both physical and logical.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 19, 2018 at 9:14 pm
Grant Fritchey - Tuesday, June 19, 2018 2:29 AMOnly concern I have is that looks like a big transaction. Will there be multiple copies of this running at the same time? If so, you could see a lot of blocking across all the tables affected.
Thanks for your feedback.
The merge completes quickly, less than a second and can update upto a max of 2000 rows.
There stored procedure is called from the workflow only, and there is not much chance i see someone running the process when the workflow already runs,
June 19, 2018 at 9:15 pm
Jeff Moden - Tuesday, June 19, 2018 3:03 PMHeh.... "Updated_By". One of the most commonly overlooked reasons for massive fragmentation both physical and logical.
could you please explain a bit more how that occurs.
June 20, 2018 at 9:22 am
george_at_sql - Tuesday, June 19, 2018 9:15 PMJeff Moden - Tuesday, June 19, 2018 3:03 PMHeh.... "Updated_By". One of the most commonly overlooked reasons for massive fragmentation both physical and logical.could you please explain a bit more how that occurs.
When you "update" a table with a varchar or nvarchar column getting changed in such a way as to increase the length, the database has to shift the data to accommodate the larger value size. That could mean that you run out of available space on a "page" within the database file, and thus be forced to split some of the data for the row between database pages, and thus a "page split" occurs, which is the cause of what Jeff is talking about. He'll refer to any such update as an "expansive update". Just know that page splits can be bad, especially when they start becoming a regular event. The reason people overlook it is because they assume that just updating such a limited size column can't cause a problem, and that's just not the case.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
June 20, 2018 at 9:48 am
george_at_sql - Tuesday, June 19, 2018 9:15 PMJeff Moden - Tuesday, June 19, 2018 3:03 PMHeh.... "Updated_By". One of the most commonly overlooked reasons for massive fragmentation both physical and logical.could you please explain a bit more how that occurs.
Sure.
Let's say that you've followed the recommendations of some of the giants in the industry (especially Kimberly Tripp, in this particular case) and, because you knew the table was going to become very large (say, 200GB in size), you elected to kill two birds with one stone by assigning the clustered index not only to support some of the more common queries, but to make it so that you wouldn't have to waste a lot of space using, say, an 80% Fill Factor, and so you also based the clustered index on an ever-increasing and very narrow value such as an IDENTITY or DateTime column. Everything SHOULD be wonderful there. New rows on the clustered index (which contains ALL columns of the data and IS the table) will be added to the logical end of the table and there will be absolutely no fragmentation caused in the clustered index.
Let's even say that the table only has fixed width data types that will be updated on some certain basis. That alone will not cause any fragmentation because the width of the rows in your nicely laid out, 100% FILL FACTOR, zero-maintenance-required, clustered index won't expand when you update such fixed width columns even if they start out as NULL. Even the VARCHAR(xx) column used by the "Created_BY" column won't cause fragmentation because the original value is never changed and so won't cause any row contraction or expansion (especially expansion).
Ok... so you have this wonderful table with 200GB of data (~26,214,400 pages) that contains 10 years (for example) of data and the clustered index is correctly based on an ever increasing value and so should never become fragmented. How is it then, that when you run your index maintenance, that you find this table has a level of fragmentation high enough to warrant defragmentation? The answer might be that someone changed a "status" column (for example) to say the row had been processed (or whatever). Since the column was correctly designed to be a fixed width column, how is it possible that changing something like that would cause such a large amount of fragmentation on such a large table?
The answer is because the "Modified_BY" column was originally NULL during the original INSERT and when someone changes the "status", they also enter a value in the "Modified_By" column and that causes expansion of the row. Now, one row being updated per page might not be a problem but, if the nature is to eventually update the "status" (again, just an example but a common one) of every row, then even with a 70% FILL FACTOR and especially with the highly sought after 100% FILL FACTOR for large tables, you're going to get page splits (potentially at massive scales, which happened where I worked and caused serious blocking and huge amount of log file activity after defragmentation) because of that damned "Modified_By" column being changed from NULL to something more substantial.
With a large table, this presents exceptional problems because, usually, such changes only occur on rows for the last 30 days or so. Most of the table is in perfect condition so far as a total lack of fragmentation goes and yet you quickly hit the 10% fragmentation mark every week or so (maybe even quicker) because of the updates being done after the more recent inserts (think "exploded tip broom). That requires you to do (according to current "best practices") an index REORGANIZE, which takes a relatively long time because it's ALWAYS single threaded and fully logged. Other "hidden" problems occur because the extra logging causes your log files to be unnecessarily large and your log file backups take longer and occupy more space and any restores you might have to do will also take longer because of it.
The truly insidious thing that REORGANIZE does is that it will remove any naturally formed free space by causing any pages that are less full than the FILL FACTOR to be brought up to the level of page fullness expressed by the FILL FACTOR and that makes the next round of page splits occur at an even more frantic rate and, in a highly ironic "Catch 22", will require you do index maintenance more often!.
So, what's the fix for the problems that the "Modified_By" column causes? There are many.
1. You could convert the table to either a temporally Partitioned Table or Partitioned View (I actually prefer Partitioned Views but that would be article-worthy and too long to post here) so that the older months would eventually be "packed" by a final defrag and the most recent months (the ones that will need it) could be defragged separately. The problem with that is that unless you fix the cause of the (in such cases) totally unnecessary fragmentation, you're still going to get some totally unnecessary blocking while the pages split (which is REALLY bad the more narrow the rows and Paul Randal proved that) along with totally unnecessary and expensive log file usage all of which results in totally unnecessary longer duration queries.
2. You could simply not use the poor-man's method of inline auditing that uses a "Modified_BY" column but convincing people of giving up that which they hold sacred would be like trying to make a pig fly with a hen that has teeth riding on its back.
3 You could build a "sister" table to hold the Modified_BY/ON columns (the Modified_ON column is usually a fixed width data type and so wouldn't be a problem but it would go along for the ride) but that would require special code or triggers, etc, etc and is totally not worth the effort.
4. You could change the Modified_BY column to a CHAR() instead of a VARCHAR() but that has the potential of wasting a lot of space because it would always need to be wide enough to handle unexpected things. A lot of people use VARCHAR(50) to handle such eventualities and converting it to a CHAR(50) on a multi-million row table would cause a huge amount of wasted space if the typical Modified_By value was only 10 or 20 characters.
5. With the idea of a fixed width field being valuable in the prevention of fragmentation, you could use an integer value and maintain a separate table of possible Modified_BY values, possibly having a trigger add any new values but, again, that's a fairly large complication for new tables and would require a large number of code changes on the front end and within SQL Server to accomplish on existing large tables.
6. And this is the method that I'm beginning to retrofit my large tables with... Instead of allowing NULLs in the Modified_BY column, force it to use a DEFAULT and add a constraint to be NOT NULL.
Unfortunately, all of those methods can require code changes, not to mention a paradigm shift in your coding standards but, #6 above can greatly limit the damage of a variable width MODIFIED_BY column on such "INSERT then modify" tables. Don't start the Modified_By column out being NULL or Blank (usually a zero length empty string rather than blank). Instead, add a default to the Modified_BY column. What should that default be?
It's a real shame that you can't use a column name in the creation of a DEFAULT because I'd default the Modified_BY column to whatever the CREATED_BY column contained. Typically, the Created_BY column will set the width trend for possible future values of the Modified_By column. So you need to do a calculation for each table you want to do this to, instead. Remember, this is for tables that regularly suffer updates where the Modified_By column will be updated. I wouldn't do this to a table that rarely suffers such a modification.
1. You have to first check to see if any code that uses the table looks for NULL to determine if the row was modified or not but it's definitely worth doing for large tables that suffer "Expansive" updates (which are expeEnsive for all the reasons previously stated). Once you've determined that either there is no code affected or have made arrangements to change the code that is, proceed to Step 2). Of course, that's for existing tables. If you know a new table is going to get large, design the code and the Modified_BY table to use a DEFAULT to begin with.
2. For existing tables, select the average DATALENGTH() of the column. This works correctly for VARCHAR() and NVARCHAR() because it counts the bytes used by both rather than the number of characters and it's the bytes the values occupy that are important here. For new tables, check on similar tables to find the average DATALENGTH() of their Modified_BY columns.
3. Add a DEFAULT consisting of the same number of spaces as the average DATALENGTH() that you determined above. This won't affect any possible joins because trailing spaces are ignored by default unless you made the serious mistake of messing with the default ANSI settings for spaces or nullability.
4. To add the NOT NULL constraint, you'll first need to update any NULLs to the new default. That's going to cause massive page splits in existing tables and you WILL need to defrag (REBUILD) the Clustered Index and any affected Non-Clustered indexes when you're done. Be VERY aware that will temporarily double the space used by the table and then return half of that as free space in the MDF/NDF file. Also be very aware that will be fully logged unless you use the BULK LOGGED recovery model (or SIMPLE, which would require at least a DIF backup to reestablish the log chain if you value your data). To prevent blowing out the MDF file for really large tables, you can do the trick of building a new file group with it's own file and rebuilding the index by doing a CREATE INDEX with DROP_EXISTING to move it to the new file group and then do it again to move it back to its original file group. When your done, don't forget to drop the temporary file group you made.
Once all that is done, you might ask "What should I change the affected code to"? It's simple. Here are three simple examples.
--===== Finds only non-blank items (also find NOT NULL items but not useful because we need a NOT NULL column)
SELECT columnlist
FROM dbo.TheTable
WHERE Modified_BY > ''
;
--===== Finds only BLANK items regardless of the number of spaces.
SELECT columnlist
FROM dbo.TheTable
WHERE Modified_BY = ''
;
--===== Finds only BLANK items regardless of the number of spaces.
-- (Alternative to the above)
SELECT columnlist
FROM dbo.TheTable
WHERE Modified_BY !> ''
;
Of course, a MUCH BETTER thing to do would be to look for dates/time in the Modified_ON column, instead because that might not require any code changes and would allow the use of NULLs without causing row expansion when it was eventually populated.
The same methods can be applied for other VARCHAR() columns that you know will eventually be updated after the original insert.
Hopefully, that answers your question about how Updated_BY or Modified_BY (or whatever they may be called) can cause both massive logical and physical fragmentation.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 20, 2018 at 10:01 am
Heh... holy moly... I think I just wrote a bloody article on the subject. Guess I'll convert it and submit it. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
June 20, 2018 at 10:02 am
sgmunson - Wednesday, June 20, 2018 9:22 AMgeorge_at_sql - Tuesday, June 19, 2018 9:15 PMJeff Moden - Tuesday, June 19, 2018 3:03 PMHeh.... "Updated_By". One of the most commonly overlooked reasons for massive fragmentation both physical and logical.could you please explain a bit more how that occurs.
When you "update" a table with a varchar or nvarchar column getting changed in such a way as to increase the length, the database has to shift the data to accommodate the larger value size. That could mean that you run out of available space on a "page" within the database file, and thus be forced to split some of the data for the row between database pages, and thus a "page split" occurs, which is the cause of what Jeff is talking about. He'll refer to any such update as an "expansive update". Just know that page splits can be bad, especially when they start becoming a regular event. The reason people overlook it is because they assume that just updating such a limited size column can't cause a problem, and that's just not the case.
You, good Sir, are absolutely correct!
--Jeff Moden
Change is inevitable... Change for the better is not.
June 20, 2018 at 10:56 am
Grant Fritchey - Tuesday, June 19, 2018 2:29 AMOnly concern I have is that looks like a big transaction. Will there be multiple copies of this running at the same time? If so, you could see a lot of blocking across all the tables affected.
Also of note, since you have one big transaction for the whole script, if it rolls back you will have lost anything you tried to log as well.
June 20, 2018 at 11:01 am
Chris Harshman - Wednesday, June 20, 2018 10:56 AMGrant Fritchey - Tuesday, June 19, 2018 2:29 AMOnly concern I have is that looks like a big transaction. Will there be multiple copies of this running at the same time? If so, you could see a lot of blocking across all the tables affected.Also of note, since you have one big transaction for the whole script, if it rolls back you will have lost anything you tried to log as well.
There's a trick to that. Store the log in a Table Valued Variable. Don't write to a permanent table until an error occurs of the run successfully completes.
Another trick that I use is to write it out to a TAB delimited file using xp_CmdShell. There's no chance of it being rolled back then even if the machine were to suddenly lose power. If you need to have the log in a table when your done, just BULK INSERT the data from the file.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 20, 2018 at 2:20 pm
Jeff Moden - Wednesday, June 20, 2018 11:01 AMChris Harshman - Wednesday, June 20, 2018 10:56 AMGrant Fritchey - Tuesday, June 19, 2018 2:29 AMOnly concern I have is that looks like a big transaction. Will there be multiple copies of this running at the same time? If so, you could see a lot of blocking across all the tables affected.Also of note, since you have one big transaction for the whole script, if it rolls back you will have lost anything you tried to log as well.
There's a trick to that. Store the log in a Table Valued Variable. Don't write to a permanent table until an error occurs of the run successfully completes.
Another trick that I use is to write it out to a TAB delimited file using xp_CmdShell. There's no chance of it being rolled back then even if the machine were to suddenly lose power. If you need to have the log in a table when your done, just BULK INSERT the data from the file.
Another trick is to use a Loopback Linked Server, and writing outside your session.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply