July 27, 2011 at 11:21 pm
I am currently involved in developing a cost-sensitive solution that needs to run on either SQL 2K8 Express/Std editions, where time-ordered data is logged into the database as fast as possible.
Currently, we are using MERGE, with a Table-Valued Parameter in a stored procedure to pass chunks of data (around 50K rows at a time) into the database and this has thus far given us acceptable performance. However, given the nature and potential size of the data, depending on the customer's requirement, we wish to have the ability to split the data between files for simpler maintenance/backup purposes, as well as address potential performance issues once the table sizes become large.
Given we cannot use SQL Enterprise to implement table partitioning, I am attempting to use a partitioned view to achieve something similar. The majority of data written to the database is likely to be inserted, not updated. Also, the log data is used to generate KPIs whose evaluation time range can be defined by the user, but the majority of queries are likely to be within the last few months.
The main issue I have now is how do I replace the MERGE in the stored procedure with a more conventional UPSERT syntax, since MERGE doesn't appear to support partitioned views. I have found this article (http://bwunder.wordpress.com/2010/12/10/persisting-data-part-4-upsert-and-merge/) that is closest to what I want to do but since I'm not a seasoned SQL expert, I'm struggling to implement the example for our purposes. Any assistance/comments/suggestions/pitfalls of how to modify the stored procedure to use a standard UPDATE/INSERT (or a totally different approach) would be highly appreciated. From the results posted in the above link, it would appear the performance of the UPSERT vs the MERGE should be quite similar with the quantity of rows we are using.
This is the basis of current stored procedure:
USE [LogDB]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
--CREATE TYPE dbo.DataLogTableType AS TABLE
-- ( LogID int,Value float,Attribute int,[TimeStamp] DateTime2) -- matches the structure of the LogData table
--
--LogData table has PK of [LogID] and [TimeStamp], no identity fields or other indexes.
CREATE PROCEDURE [dbo].[sp_Log_BulkInsertUpdate]
@IncomingLogData DataLogTableType READONLY
AS
MERGE LogData AS Target
USING
@IncomingLogData
AS SOURCE
ON (TARGET.TimeStamp = SOURCE.TimeStamp AND TARGET.LogID = SOURCE.LogID)
WHEN NOT MATCHED-- BY TARGET
THEN
INSERT(LogID,Value,Attribute,TimeStamp)
VALUES(SOURCE.LogID,SOURCE.Value,SOURCE.Attribute,SOURCE.TimeStamp)
WHEN MATCHED
THEN
UPDATE SET
TARGET.Value = SOURCE.Value,
TARGET.Attribute = SOURCE.Attribute;
SELECT @@ROWCOUNT
GO
The partitioning plan is to programmaticaly create time-ranged log tables, e.g. LogData20110101, LogData20110201, LogData20110301 for example on the fly if a monthly partition was chosen and update the partitioned view appropriately and make the view (called LogDataView, for example), the target in the above stored procedure.
A sample LogDataView would be:
SELECT Value, Attribute [TimeStamp], LogID
FROM LogData20110101
UNION ALL
SELECT Value, Attribute [TimeStamp], LogID
FROM LogData20110201
UNION ALL
SELECT Value, Attribute [TimeStamp], LogID
FROM LogData20110301
and each LogData table would have the appropriate CHECK constraint implemented on the [TimeStamp] column.
July 29, 2011 at 12:57 pm
About the only way I've ever been able to get something like this to work is with INSTEAD OF triggers on the View. One for INSERT and another for UPDATE. The trigger figures out where to actually put the data.
I've never done this sort of thing on large data sets for the INSERT/UPDATE so I don't know how well it scales up. I've done them for onesey/twosey type applications and they work just fine.
Todd Fifield
July 29, 2011 at 3:39 pm
mmmhhh... TimeStamp data type columns may be a deal breaker, check this: http://msdn.microsoft.com/en-us/library/aa933141(v=SQL.80).aspx
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.July 30, 2011 at 7:33 am
Maybe something like this?
DECLARE @rowcount INT;
BEGIN TRANSACTION
UPDATE TARGET
SET TARGET.Value = SOURCE.Value
, TARGET.Attribute = SOURCE.Attribute
FROM @IncomingLogData SOURCE
JOIN dbo.LogData TARGET
ON SOURCE.LogID = TARGET.LogID
AND SOURCE.Timestamp b= TARGET.Timestamp;
SET @Rowcount = @@ROWCOUNT;
INSERT dbo.LogData
( LogID
, Value
, Attribute
, TimeStamp )
SELECT
LogID
, Value
, Attribute
, TimeStamp
FROM @IncomingLogData SOURCE
JOIN dbo.LogData TARGET
ON SOURCE.LogID = TARGET.LogID
AND SOURCE.Timestamp b= TARGET.Timestamp
WHERE TARGET.LogId IS NULL
AND TARGET.Timestamp IS NULL;
SET @Rowcount += @@ROWCOUNT;
COMMIT TRANSACTION
Looks like you may benefit from an index on [LogId], [Timestamp] on your tvp to me... That would give your tvp at lease one of the 2 allowed indexes on an @table_variable. Are the incoming rows always in P-Key order? The rows are narrow, but for 50,000 rows, you may want to consider staging the data - at least to tempdb. The performance advantage of isolating the import from the DML are usually impressive, especially as the number of possible concurrent imports grows. This could increase the flexibility you anticipate. Just SELECT the tvp INTO a temp table first thing in the stored procedure. Then add a primary key and any other indexes you might need. You'd need to monitor under load to know for sure what you need and then later to confirm that what you did worked...
I did not know that MERGE and partitioned views don't play together. I do know that MERGE and table partitioning work well and the performance and management advantages of the table partition over the indexed view might be worth a look.
Bill Wunder
July 31, 2011 at 11:54 pm
Bill,
Thanks for the info. Your script is quite close to what I had managed to attain in the meantime. However, I'm falling down on the INSERT part (UPDATE works fine) of the procedure. From what I can gather, the WHERE clause of the INSERT is never matched, hence no updates ever occur. If I change the INSERT to a joined select to see what data is actually worked upon, I get no rows, even if I manually insert data into the tables and attempt to insert duplicate values through the stored procedure. Is there some prerequisite on the key values to evaluate them to NULL?
August 1, 2011 at 12:50 am
After playing around with the script, I managed to gain some 'success'. I modified the INSERT part of the script as follows:
INSERT dbo.LogData
( LogID
, Value
, Attribute
, TimeStamp )
SELECT
TARGET.LogID
, TARGET.Value
, TARGET.Attribute
, TARGET.TimeStamp
FROM @IncomingLogData SOURCE
LEFT JOIN dbo.LogData TARGET
ON SOURCE.LogID = TARGET.LogID
AND SOURCE.Timestamp b= TARGET.Timestamp
WHERE TARGET.LogId IS NULL
AND TARGET.Timestamp IS NULL;
This gives the expected SELECT results. However, I get back to an error I had previously encountered, specific to partitioned views:
Partitioned view 'dbo.LogTagData' is not updatable because the source query contains references to partition table '[dbo].[LogTagData1106]'.
Looks like the only way this is going to work is via 'software' partitioning, i.e: creating tables via code or replacing SQL with something else, like a NOSQL option.
August 1, 2011 at 6:48 am
Wonder the UPDATE does not fail too?
At any rate, nothing to do with Upsert or MERGE but yet another partitioned view issue.
Perhaps all you need to do is take that target partitioned view out of the query.
Use an OUTPUT clause on the update to capture the LogId, TimeStamp of each row updated into yet another table variable, then LEFT OUTER JOIN that table variable to the tvp. Something like (like before not tested)
DECLARE @rowcount INT;
DECLARE @updated TABLE (LogId INT, Timestamp DATETIME2, PRIMARY KEY (LogId, Timestamp));
BEGIN TRANSACTION
UPDATE TARGET
SET TARGET.Value = SOURCE.Value
, TARGET.Attribute = SOURCE.Attribute
OUTPUT SOURCE.LogId, SOURCE.Timestamp INTO @updated
FROM @IncomingLogData SOURCE
JOIN dbo.LogData TARGET
ON SOURCE.LogID = TARGET.LogID
AND SOURCE.Timestamp b= TARGET.Timestamp;
SET @Rowcount = @@ROWCOUNT;
INSERT dbo.LogData
( LogID
, Value
, Attribute
, TimeStamp )
SELECT
s.LogID
, s.Value
, s.Attribute
, s.TimeStamp
FROM @IncomingLogData s
LEFT JOIN @updated u
ON s.LogID = u.LogID
AND s.Timestamp = u.Timestamp
WHERE u.LogId IS NULL
AND u.Timestamp IS NULL;
SET @Rowcount += @@ROWCOUNT;
COMMIT TRANSACTION
Bill Wunder
August 1, 2011 at 7:09 am
Unfortunately, I had tried an OUTPUT in another scenario. SQL won't accept the OUTPUT because the target is a partitioned view:
The OUTPUT clause cannot be specified because the target view "TARGET" is a partitioned view.
:ermm:
August 1, 2011 at 7:53 am
You could decompose it another level and populate the @updated table yourself...
But I agree, better to step back and identify the best way to do what you need to do if not satisfied with your progress.
At least you helped me remember why I have so successfully avoided partitioned views for so long...
Why the partitioned view anyway? distributed? file groups? delete/archive? legacy?
Bill Wunder
August 1, 2011 at 8:19 am
As a Microsoft partner, SQL would be the best fit, and would have the least code ramifications at this stage. I already have the MERGE part working in a .NET environment, as a proof of concept. However, given the nature and potential size of the data, depending on the customer's requirement, I wish to have the ability to split the data between files for simpler maintenance/backup purposes, as well as address potential performance issues once the table sizes become large. I knew about SQL table partitioning when embarking on the project - I just didn't know it was only for Enterprise editions until recently. So, now I'm somewhat stuck in that ideally I'd to get away with the minimal number of changes to the code as possible, but that doesn't seem possible if we stick with SQL server at this stage.
I'm busy looking at MySQL/DB2 as alternatives, as they do support partitioning, and importantly in this case, I can get away with using the free editions if required (not to say that I am opposed to paying for stuff, just not what SQL Enterprise costs - for this particular solution, that is not viable). Needless to say, MySQL has its own problems (have to store DateTimes as Ticks since MySQL DateTime doesn't store Milliseconds) and I also need to validate if I can actually pass data to it fast enough to at least match what SQL could do with TVP's.
Finally, the other option would be to "manage" the partitioning in the .NET code itself by dividing the periods into separate tables that we could map into individual files, which could be backed up/rotated in out of the working data set as required. I'm not 100% confident yet this can be done with SQL either. I quite like the DB2 ability to rotate the partitions as required (http://www.ibm.com/developerworks/data/library/techarticle/dm-0605ahuja2/index.html), but more investigation is required.
August 2, 2011 at 8:03 am
"cheap" and "high performance" just don't go together. I would bet your hardware is not up to the task you are asking either. If other database platforms offer what you need at the "cheap/free" end of the spectrum, go for it. But sometimes you just can't get where you need to be on the cheap.
I will add that I have NEVER been to a client in the last 15 years where existing things could not be improved SIGNIFICANTLY. Perhaps you need a professional to give your stuff a review to see if your existing processes can be made better?
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply