SQL Server 2017 - Transferring Transformed Data to a Staging Server VERY frequently

  • Hi all,

    I have the following scenario with a client:  The client has multiple SQL Server databases (versions range from 2008 R2 to 2016) used for various parts of the business.  He needs to consolidate data from all these databases (so far there are five) to a central staging server.  The data is to be somewhat transformed (i.e. certain filters applied, concatation and splitting of fields, joining of tables) but nothing a T-SQL view can't handle.  The data is to be written into these SQL Server 2017 staging server tables and I should note that data from one business database will never be inserted into the same table as from another.  That is to say each table in the staging database is always fed data from ONLY one database.

    The expected data volume is small, estimated to be 5,000 to 10,000 records per day for each table (so not more than 20 MB per day) but the challenge lies in the frequency of transfer.  The client would like the data to be available on the staging server within 5 minutes of entry into the business database.  So I see the following options:

    Option 1:  Create an SSIS package that extracts the changed data via Timestamp (we cannot implement CDC on the business databases) with a MERGE statement (INSERT and UPDATE, no DELETE necessary) and schedule it to run every five minutes.  I would have nothing against this solution if the requirements stated that the package should be run only several times daily or even once every hour.  But isn't scheduling an SSIS package to run every five minutes overkill?

    Option 2:  Create views on the business databases that transform the necessary data from the tables and then use REPLICATION to transfer the data every five minutes to the corresponding tables in the staging database.  I've used replication before but am no expert on it so is it even possible to transfer a view to a table?  Also this would dependent on a frequent SQL Server Agent job.

    Are there any better options out there using SQL Server native tools?

    Suggestions appreciated.

  • At first I thought about the idea of replicating a view as a table as potentially problematic, so I did a web search and found some text on this Microsoft page:

    https://docs.microsoft.com/en-us/sql/relational-databases/replication/publish/make-schema-changes-on-publication-databases?view=sql-server-2017

    Where the following text was located:

    Transactional Replication

    • Schema changes are propagated to Subscribers running previous versions of SQL Server, but the DDL statement should only include syntax supported by the version at the Subscriber.

      If the Subscriber republishes data, the only supported schema changes are adding and dropping a column. These changes should be made on the Publisher using sp_repladdcolumn (Transact-SQL) and sp_repldropcolumn (Transact-SQL) rather than ALTER TABLE DDL syntax.

    • Schema changes are not replicated to non-SQL Server Subscribers.
    • Schema changes are not propagated from non- SQL Server Publishers.
    • You cannot alter indexed views that are replicated as tables. Indexed views that are replicated as indexed views can be altered, but altering them will cause them to become regular views, rather than indexed views.
    • If the publication supports immediate updating or queued updating subscriptions, the system must be quiesced before making schema changes: all activity on the published table must be stopped at the Publisher and Subscribers, and pending data changes must be propagated to all nodes. After the schema changes have propagated to all nodes, activity can resume on the published tables.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • TheComedian - Monday, October 1, 2018 1:20 AM

    Hi all,

    I have the following scenario with a client:  The client has multiple SQL Server databases (versions range from 2008 R2 to 2016) used for various parts of the business.  He needs to consolidate data from all these databases (so far there are five) to a central staging server.  The data is to be somewhat transformed (i.e. certain filters applied, concatation and splitting of fields, joining of tables) but nothing a T-SQL view can't handle.  The data is to be written into these SQL Server 2017 staging server tables and I should note that data from one business database will never be inserted into the same table as from another.  That is to say each table in the staging database is always fed data from ONLY one database.

    The expected data volume is small, estimated to be 5,000 to 10,000 records per day for each table (so not more than 20 MB per day) but the challenge lies in the frequency of transfer.  The client would like the data to be available on the staging server within 5 minutes of entry into the business database.  So I see the following options:

    Option 1:  Create an SSIS package that extracts the changed data via Timestamp (we cannot implement CDC on the business databases) with a MERGE statement (INSERT and UPDATE, no DELETE necessary) and schedule it to run every five minutes.  I would have nothing against this solution if the requirements stated that the package should be run only several times daily or even once every hour.  But isn't scheduling an SSIS package to run every five minutes overkill?

    Option 2:  Create views on the business databases that transform the necessary data from the tables and then use REPLICATION to transfer the data every five minutes to the corresponding tables in the staging database.  I've used replication before but am no expert on it so is it even possible to transfer a view to a table?  Also this would dependent on a frequent SQL Server Agent job.

    Are there any better options out there using SQL Server native tools?

    Suggestions appreciated.

    Assuming all the tables on the database you are importing from have a DateStamp column that records the last time the row was Inserted/Update. I would have a table on the Staging database that stores the table-name and most recent DateStamp imported, then just import all rows later than that date. You could do it with an SSIS package for each table or just with a simple SQL script in a job that runs every 5 minutes.

  • sgmunson - Monday, October 1, 2018 8:48 AM

    At first I thought about the idea of replicating a view as a table as potentially problematic, so I did a web search and found some text on this Microsoft page:

    https://docs.microsoft.com/en-us/sql/relational-databases/replication/publish/make-schema-changes-on-publication-databases?view=sql-server-2017

    Where the following text was located:

    Transactional Replication

    • Schema changes are propagated to Subscribers running previous versions of SQL Server, but the DDL statement should only include syntax supported by the version at the Subscriber.

      If the Subscriber republishes data, the only supported schema changes are adding and dropping a column. These changes should be made on the Publisher using sp_repladdcolumn (Transact-SQL) and sp_repldropcolumn (Transact-SQL) rather than ALTER TABLE DDL syntax.

    • Schema changes are not replicated to non-SQL Server Subscribers.
    • Schema changes are not propagated from non- SQL Server Publishers.
    • You cannot alter indexed views that are replicated as tables. Indexed views that are replicated as indexed views can be altered, but altering them will cause them to become regular views, rather than indexed views.
    • If the publication supports immediate updating or queued updating subscriptions, the system must be quiesced before making schema changes: all activity on the published table must be stopped at the Publisher and Subscribers, and pending data changes must be propagated to all nodes. After the schema changes have propagated to all nodes, activity can resume on the published tables.

    Thanks for the reply.  However on further consultation with the client, I believe that replication is out of the question because they want to do as little work as possible on the business servers and they want to avoid giving me db_owner/sysadmin privileges there which I need to install the Publishers.

    Jonathan AC Roberts - Monday, October 1, 2018 9:40 AM

    TheComedian - Monday, October 1, 2018 1:20 AM

    Hi all,

    I have the following scenario with a client:  The client has multiple SQL Server databases (versions range from 2008 R2 to 2016) used for various parts of the business.  He needs to consolidate data from all these databases (so far there are five) to a central staging server.  The data is to be somewhat transformed (i.e. certain filters applied, concatation and splitting of fields, joining of tables) but nothing a T-SQL view can't handle.  The data is to be written into these SQL Server 2017 staging server tables and I should note that data from one business database will never be inserted into the same table as from another.  That is to say each table in the staging database is always fed data from ONLY one database.

    The expected data volume is small, estimated to be 5,000 to 10,000 records per day for each table (so not more than 20 MB per day) but the challenge lies in the frequency of transfer.  The client would like the data to be available on the staging server within 5 minutes of entry into the business database.  So I see the following options:

    Option 1:  Create an SSIS package that extracts the changed data via Timestamp (we cannot implement CDC on the business databases) with a MERGE statement (INSERT and UPDATE, no DELETE necessary) and schedule it to run every five minutes.  I would have nothing against this solution if the requirements stated that the package should be run only several times daily or even once every hour.  But isn't scheduling an SSIS package to run every five minutes overkill?

    Option 2:  Create views on the business databases that transform the necessary data from the tables and then use REPLICATION to transfer the data every five minutes to the corresponding tables in the staging database.  I've used replication before but am no expert on it so is it even possible to transfer a view to a table?  Also this would dependent on a frequent SQL Server Agent job.

    Are there any better options out there using SQL Server native tools?

    Suggestions appreciated.

    Assuming all the tables on the database you are importing from have a DateStamp column that records the last time the row was Inserted/Update.on the I would have a table on the Staging database that stores the table-name and most recent DateStamp imported, then just import all rows later than that date. You could do it an SSIS package for each table or just with a simple SQL script in a job that runs every 5 minutes.

    That's what I was thinking.  But I was just thinking whether a SQL Server Agent running every 5 minutes isn't a bit overkill.

  • Running a job like that isn't really an overkill - on our main prod servers we have jobs running every 10 seconds. Very fast jobs though.

    For data transfer we have some jobs running every 5 mins - transferring in sequence around 20k rows from one server to another from 4 distinct tables. as one of the tables have blobs it takes between 20 and 250 seconds - extract, compress blob, load to destination server - verifying its not there already.

    This done in C# although SSIS would also work - just more complicated to do and it would still require a c# script within it.

    Main thing really is if the source/destination server can cope with the process - If you have a server with 2 cpu's with a load of 80% will this extra job make it go to 100%?
    Or will the joins you have to perform require loading data onto memory that would not normally be there during working hours and will it trash the buffer pool?
    This is what you need to consider.

  • frederico_fonseca - Tuesday, October 2, 2018 5:19 AM

    This done in C# although SSIS would also work - just more complicated to do and it would still require a c# script within it.

    I agree with your other points, but not this one.

    In the OP's own words:

    ... certain filters applied, concatation and splitting of fields, joining of tables) but nothing a T-SQL view can't handle

    I can't see why any C# would be required to achieve this in SSIS.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin - Tuesday, October 2, 2018 5:32 AM

    frederico_fonseca - Tuesday, October 2, 2018 5:19 AM

    This done in C# although SSIS would also work - just more complicated to do and it would still require a c# script within it.

    I agree with your other points, but not this one.

    In the OP's own words:

    ... certain filters applied, concatation and splitting of fields, joining of tables) but nothing a T-SQL view can't handle

    I can't see why any C# would be required to achieve this in SSIS.

    bit of misunderstanding here - my code was on C# as it was required to compress particular columns from the source data on the way to the destination table/server - this is what requires c# (or with new versions of SQL the new compress on the source select

  • I setup a very similar process using RowVersion datatypes.  Then kept track of last RowVersion for each table.  I go the largest RowVersion > previous and then retrieved all records in between.  That way I didn't miss any if they changed while I was retrieving.  Process ran every 30 seconds and was very fast.  Just make sure you index the Rowversion column.  We setup an SSIS package to handle it all

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • TheComedian - Tuesday, October 2, 2018 3:18 AM

    sgmunson - Monday, October 1, 2018 8:48 AM

    At first I thought about the idea of replicating a view as a table as potentially problematic, so I did a web search and found some text on this Microsoft page:

    https://docs.microsoft.com/en-us/sql/relational-databases/replication/publish/make-schema-changes-on-publication-databases?view=sql-server-2017

    Where the following text was located:

    Transactional Replication

    • Schema changes are propagated to Subscribers running previous versions of SQL Server, but the DDL statement should only include syntax supported by the version at the Subscriber.

      If the Subscriber republishes data, the only supported schema changes are adding and dropping a column. These changes should be made on the Publisher using sp_repladdcolumn (Transact-SQL) and sp_repldropcolumn (Transact-SQL) rather than ALTER TABLE DDL syntax.

    • Schema changes are not replicated to non-SQL Server Subscribers.
    • Schema changes are not propagated from non- SQL Server Publishers.
    • You cannot alter indexed views that are replicated as tables. Indexed views that are replicated as indexed views can be altered, but altering them will cause them to become regular views, rather than indexed views.
    • If the publication supports immediate updating or queued updating subscriptions, the system must be quiesced before making schema changes: all activity on the published table must be stopped at the Publisher and Subscribers, and pending data changes must be propagated to all nodes. After the schema changes have propagated to all nodes, activity can resume on the published tables.

    Thanks for the reply.  However on further consultation with the client, I believe that replication is out of the question because they want to do as little work as possible on the business servers and they want to avoid giving me db_owner/sysadmin privileges there which I need to install the Publishers.

    ...Jonathan A.C. Roberts...

    That's what I was thinking.  But I was just thinking whether a SQL Server Agent running every 5 minutes isn't a bit overkill.

    If you think running an Agent job every 5 minutes is overkill, then you must be a bit confused.   Replication would reduce the total bandwidth needed considerably.   If you have data changes in any kind of volume, having to copy from a view into the destination tables using MERGE, but only after waiting 5 or more minutes, would likely be considerably more painful than replication would be.   Privs or not, replication is something you can lead their DBA through setting up.   If they can't handle replication, wait until you start copying considerably more data across the wire in bulk after a delay of say, 10 to 15 minutes (because you think 5 minutes is overkill).   Let's just say it can get really ugly really quickly with any volume.   Just getting the initial population of those tables to happen might well need to be an off-hours event.   You can be quite effective with replication, but wait too long to do the updates and you may end up building in a technical debt that brings the house down.   If you can wait considerably longer for refreshed data, you might be able to solve that problem, but will the customer be happy with it?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • frederico_fonseca - Tuesday, October 2, 2018 6:55 AM

    bit of misunderstanding here - my code was on C# as it was required to compress particular columns from the source data on the way to the destination table/server - this is what requires c# (or with new versions of SQL the new compress on the source select

    Aha, got it & understood.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply