Partitioned data (Non Enterprise SQL), Table-Valued Parameter and Upsert

  • I am currently involved in developing a cost-sensitive solution that needs to run on either SQL 2K8 Express/Std editions, where time-ordered data is logged into the database as fast as possible.

    Currently, we are using MERGE, with a Table-Valued Parameter in a stored procedure to pass chunks of data (around 50K rows at a time) into the database and this has thus far given us acceptable performance. However, given the nature and potential size of the data, depending on the customer's requirement, we wish to have the ability to split the data between files for simpler maintenance/backup purposes, as well as address potential performance issues once the table sizes become large.

    Given we cannot use SQL Enterprise to implement table partitioning, I am attempting to use a partitioned view to achieve something similar. The majority of data written to the database is likely to be inserted, not updated. Also, the log data is used to generate KPIs whose evaluation time range can be defined by the user, but the majority of queries are likely to be within the last few months.

    The main issue I have now is how do I replace the MERGE in the stored procedure with a more conventional UPSERT syntax, since MERGE doesn't appear to support partitioned views. I have found this article (http://bwunder.wordpress.com/2010/12/10/persisting-data-part-4-upsert-and-merge/) that is closest to what I want to do but since I'm not a seasoned SQL expert, I'm struggling to implement the example for our purposes. Any assistance/comments/suggestions/pitfalls of how to modify the stored procedure to use a standard UPDATE/INSERT (or a totally different approach) would be highly appreciated. From the results posted in the above link, it would appear the performance of the UPSERT vs the MERGE should be quite similar with the quantity of rows we are using.

    This is the basis of current stored procedure:

    USE [LogDB]

    GO

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    --CREATE TYPE dbo.DataLogTableType AS TABLE

    -- ( LogID int,Value float,Attribute int,[TimeStamp] DateTime2) -- matches the structure of the LogData table

    --

    --LogData table has PK of [LogID] and [TimeStamp], no identity fields or other indexes.

    CREATE PROCEDURE [dbo].[sp_Log_BulkInsertUpdate]

    @IncomingLogData DataLogTableType READONLY

    AS

    MERGE LogData AS Target

    USING

    @IncomingLogData

    AS SOURCE

    ON (TARGET.TimeStamp = SOURCE.TimeStamp AND TARGET.LogID = SOURCE.LogID)

    WHEN NOT MATCHED-- BY TARGET

    THEN

    INSERT(LogID,Value,Attribute,TimeStamp)

    VALUES(SOURCE.LogID,SOURCE.Value,SOURCE.Attribute,SOURCE.TimeStamp)

    WHEN MATCHED

    THEN

    UPDATE SET

    TARGET.Value = SOURCE.Value,

    TARGET.Attribute = SOURCE.Attribute;

    SELECT @@ROWCOUNT

    GO

    The partitioning plan is to programmaticaly create time-ranged log tables, e.g. LogData20110101, LogData20110201, LogData20110301 for example on the fly if a monthly partition was chosen and update the partitioned view appropriately and make the view (called LogDataView, for example), the target in the above stored procedure.

    A sample LogDataView would be:

    SELECT Value, Attribute [TimeStamp], LogID

    FROM LogData20110101

    UNION ALL

    SELECT Value, Attribute [TimeStamp], LogID

    FROM LogData20110201

    UNION ALL

    SELECT Value, Attribute [TimeStamp], LogID

    FROM LogData20110301

    and each LogData table would have the appropriate CHECK constraint implemented on the [TimeStamp] column.

  • About the only way I've ever been able to get something like this to work is with INSTEAD OF triggers on the View. One for INSERT and another for UPDATE. The trigger figures out where to actually put the data.

    I've never done this sort of thing on large data sets for the INSERT/UPDATE so I don't know how well it scales up. I've done them for onesey/twosey type applications and they work just fine.

    Todd Fifield

  • mmmhhh... TimeStamp data type columns may be a deal breaker, check this: http://msdn.microsoft.com/en-us/library/aa933141(v=SQL.80).aspx

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Maybe something like this?

    DECLARE @rowcount INT;

    BEGIN TRANSACTION

    UPDATE TARGET

    SET TARGET.Value = SOURCE.Value

    , TARGET.Attribute = SOURCE.Attribute

    FROM @IncomingLogData SOURCE

    JOIN dbo.LogData TARGET

    ON SOURCE.LogID = TARGET.LogID

    AND SOURCE.Timestamp b= TARGET.Timestamp;

    SET @Rowcount = @@ROWCOUNT;

    INSERT dbo.LogData

    ( LogID

    , Value

    , Attribute

    , TimeStamp )

    SELECT

    LogID

    , Value

    , Attribute

    , TimeStamp

    FROM @IncomingLogData SOURCE

    JOIN dbo.LogData TARGET

    ON SOURCE.LogID = TARGET.LogID

    AND SOURCE.Timestamp b= TARGET.Timestamp

    WHERE TARGET.LogId IS NULL

    AND TARGET.Timestamp IS NULL;

    SET @Rowcount += @@ROWCOUNT;

    COMMIT TRANSACTION

    Looks like you may benefit from an index on [LogId], [Timestamp] on your tvp to me... That would give your tvp at lease one of the 2 allowed indexes on an @table_variable. Are the incoming rows always in P-Key order? The rows are narrow, but for 50,000 rows, you may want to consider staging the data - at least to tempdb. The performance advantage of isolating the import from the DML are usually impressive, especially as the number of possible concurrent imports grows. This could increase the flexibility you anticipate. Just SELECT the tvp INTO a temp table first thing in the stored procedure. Then add a primary key and any other indexes you might need. You'd need to monitor under load to know for sure what you need and then later to confirm that what you did worked...

    I did not know that MERGE and partitioned views don't play together. I do know that MERGE and table partitioning work well and the performance and management advantages of the table partition over the indexed view might be worth a look.

    Bill Wunder

  • Bill,

    Thanks for the info. Your script is quite close to what I had managed to attain in the meantime. However, I'm falling down on the INSERT part (UPDATE works fine) of the procedure. From what I can gather, the WHERE clause of the INSERT is never matched, hence no updates ever occur. If I change the INSERT to a joined select to see what data is actually worked upon, I get no rows, even if I manually insert data into the tables and attempt to insert duplicate values through the stored procedure. Is there some prerequisite on the key values to evaluate them to NULL?

  • After playing around with the script, I managed to gain some 'success'. I modified the INSERT part of the script as follows:

    INSERT dbo.LogData

    ( LogID

    , Value

    , Attribute

    , TimeStamp )

    SELECT

    TARGET.LogID

    , TARGET.Value

    , TARGET.Attribute

    , TARGET.TimeStamp

    FROM @IncomingLogData SOURCE

    LEFT JOIN dbo.LogData TARGET

    ON SOURCE.LogID = TARGET.LogID

    AND SOURCE.Timestamp b= TARGET.Timestamp

    WHERE TARGET.LogId IS NULL

    AND TARGET.Timestamp IS NULL;

    This gives the expected SELECT results. However, I get back to an error I had previously encountered, specific to partitioned views:

    Partitioned view 'dbo.LogTagData' is not updatable because the source query contains references to partition table '[dbo].[LogTagData1106]'.

    Looks like the only way this is going to work is via 'software' partitioning, i.e: creating tables via code or replacing SQL with something else, like a NOSQL option.

  • Wonder the UPDATE does not fail too?

    At any rate, nothing to do with Upsert or MERGE but yet another partitioned view issue.

    Perhaps all you need to do is take that target partitioned view out of the query.

    Use an OUTPUT clause on the update to capture the LogId, TimeStamp of each row updated into yet another table variable, then LEFT OUTER JOIN that table variable to the tvp. Something like (like before not tested)

    DECLARE @rowcount INT;

    DECLARE @updated TABLE (LogId INT, Timestamp DATETIME2, PRIMARY KEY (LogId, Timestamp));

    BEGIN TRANSACTION

    UPDATE TARGET

    SET TARGET.Value = SOURCE.Value

    , TARGET.Attribute = SOURCE.Attribute

    OUTPUT SOURCE.LogId, SOURCE.Timestamp INTO @updated

    FROM @IncomingLogData SOURCE

    JOIN dbo.LogData TARGET

    ON SOURCE.LogID = TARGET.LogID

    AND SOURCE.Timestamp b= TARGET.Timestamp;

    SET @Rowcount = @@ROWCOUNT;

    INSERT dbo.LogData

    ( LogID

    , Value

    , Attribute

    , TimeStamp )

    SELECT

    s.LogID

    , s.Value

    , s.Attribute

    , s.TimeStamp

    FROM @IncomingLogData s

    LEFT JOIN @updated u

    ON s.LogID = u.LogID

    AND s.Timestamp = u.Timestamp

    WHERE u.LogId IS NULL

    AND u.Timestamp IS NULL;

    SET @Rowcount += @@ROWCOUNT;

    COMMIT TRANSACTION

    Bill Wunder

  • Unfortunately, I had tried an OUTPUT in another scenario. SQL won't accept the OUTPUT because the target is a partitioned view:

    The OUTPUT clause cannot be specified because the target view "TARGET" is a partitioned view.

    :ermm:

  • You could decompose it another level and populate the @updated table yourself...

    But I agree, better to step back and identify the best way to do what you need to do if not satisfied with your progress.

    At least you helped me remember why I have so successfully avoided partitioned views for so long...

    Why the partitioned view anyway? distributed? file groups? delete/archive? legacy?

    Bill Wunder

  • As a Microsoft partner, SQL would be the best fit, and would have the least code ramifications at this stage. I already have the MERGE part working in a .NET environment, as a proof of concept. However, given the nature and potential size of the data, depending on the customer's requirement, I wish to have the ability to split the data between files for simpler maintenance/backup purposes, as well as address potential performance issues once the table sizes become large. I knew about SQL table partitioning when embarking on the project - I just didn't know it was only for Enterprise editions until recently. So, now I'm somewhat stuck in that ideally I'd to get away with the minimal number of changes to the code as possible, but that doesn't seem possible if we stick with SQL server at this stage.

    I'm busy looking at MySQL/DB2 as alternatives, as they do support partitioning, and importantly in this case, I can get away with using the free editions if required (not to say that I am opposed to paying for stuff, just not what SQL Enterprise costs - for this particular solution, that is not viable). Needless to say, MySQL has its own problems (have to store DateTimes as Ticks since MySQL DateTime doesn't store Milliseconds) and I also need to validate if I can actually pass data to it fast enough to at least match what SQL could do with TVP's.

    Finally, the other option would be to "manage" the partitioning in the .NET code itself by dividing the periods into separate tables that we could map into individual files, which could be backed up/rotated in out of the working data set as required. I'm not 100% confident yet this can be done with SQL either. I quite like the DB2 ability to rotate the partitions as required (http://www.ibm.com/developerworks/data/library/techarticle/dm-0605ahuja2/index.html), but more investigation is required.

  • "cheap" and "high performance" just don't go together. I would bet your hardware is not up to the task you are asking either. If other database platforms offer what you need at the "cheap/free" end of the spectrum, go for it. But sometimes you just can't get where you need to be on the cheap.

    I will add that I have NEVER been to a client in the last 15 years where existing things could not be improved SIGNIFICANTLY. Perhaps you need a professional to give your stuff a review to see if your existing processes can be made better?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply