Best method for moving constant flow of data into SQL

  • Hi all,

    I have developed a process which takes a non-SQL database, extracts data, loads it into a SQL database in a temp storage area before it generates a reporting database, with data broken out into tables optimised for reporting and to make the data more intuitive to report from.

    This process is a batch process to be executed once a day.

    I have now developed a process to move data across in a more real-time fashion. I am certain I have found the best method for moving the data from the non-SQL data into SQL, however I now need to work out the best method for getting the data from the staging area into the reporting database.

    I'm looking for various thoughts and options here with pros and cons as to the best method to use... (SQL 2005 & SQL 2008)

    1. Triggers

    2. SQL Broker

    3. ?

  • martynjreid (5/16/2011)


    Hi all,

    I have developed a process which takes a non-SQL database, extracts data, loads it into a SQL database in a temp storage area before it generates a reporting database, with data broken out into tables optimised for reporting and to make the data more intuitive to report from.

    This process is a batch process to be executed once a day.

    I have now developed a process to move data across in a more real-time fashion. I am certain I have found the best method for moving the data from the non-SQL data into SQL, however I now need to work out the best method for getting the data from the staging area into the reporting database.

    I'm looking for various thoughts and options here with pros and cons as to the best method to use... (SQL 2005 & SQL 2008)

    1. Triggers

    2. SQL Broker

    3. ?

    1. With Triggers you are going to have a lot of overhead

    2. With SQL Broker you going to have to write a lot of code to extract the data

    3. If you were running SQL Server 2008 you could use Change Data Capture

    http://msdn.microsoft.com/en-us/library/bb522489.aspx

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • for me the big question is the number of rows moved;

    when your procedure was running once a day, what did it do? MERGE statement? drop and recreate the data? add a new partition and use the SWITCH command to bring in a new days worth of data?

    if the same proc was run every 2 hours instead of once a day, would that work, for example?

    so much depends on the load..if it's enough rows to require a table lock to insert (or update?) the data, it might affect the way you want to bring the data in to minimize locking.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Welsh Corgi (5/16/2011)


    1. With Triggers you are going to have a lot of overhead

    2. With SQL Broker you going to have to write a lot of code to extract the data

    3. If you were running SQL Server 2008 you could use Change Data Capture

    http://msdn.microsoft.com/en-us/library/bb522489.aspx

    I need a solution that supports both 2005 and 2008.

  • If you could upgrade to SQL Server 2008 and use the Change Data Capture Method you would be better off.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Lowell (5/16/2011)


    for me the big question is the number of rows moved;

    when your procedure was running once a day, what did it do? MERGE statement? drop and recreate the data? add a new partition and use the SWITCH command to bring in a new days worth of data?

    if the same proc was run every 2 hours instead of once a day, would that work, for example?

    so much depends on the load..if it's enough rows to require a table lock to insert (or update?) the data, it might affect the way you want to bring the data in to minimize locking.

    Unfortunately, the extraction process is not mine and only extracts everything. There is no concept of a day's delta.

    I adopt an extract everything process with a complete drop and recreate of both the staging area and the reporting db.

    To make matters worse the source database allows the software administrators to create user defined screens of fields. Each screen is stored in one massive table as the equivalent of varchar(max); with a maximum of 30 fields on a screen. Another table defines this data; so a field may be defined as a datetime. It gets worse again: the software doesn't validate the data stored against it's datatype; the datetime field could be changed to an integer and the old data is not deleted it is simply masked by the application.

    I load this data into the staging area and during the reporting database build I massage this data (millions and millions of rows) into strongly typed data whilst also splitting it out into a multitude of tables.

    Believe me, I wish more than anyone that the source application doesn't do what it does but alas I cannot change any of that...

  • Welsh Corgi (5/16/2011)


    If you could upgrade to SQL Server 2008 and use the Change Data Capture Method you would be better off.

    Thanks, unfortunately this is a released product with about 20 clients. We support SQL2005 and cannot realistically change. I need to support our existing clients.

  • What time constraints are you looking at?

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • martynjreid (5/16/2011)


    Welsh Corgi (5/16/2011)


    Thanks, unfortunately this is a released product with about 20 clients. We support SQL2005 and cannot realistically change. I need to support our existing clients.

    If you only have 20 clients and you can't upgrade to 2008 then Triggers is probably your best option.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Welsh Corgi (5/16/2011)


    What time constraints are you looking at?

    The ETL process takes anywhere from 30 mins to 8 hrs dependant on hardware and size of the source database.

  • I have a few other questions about what you're dealing with here, and to clarify I understand your process.

    Let's start with the staging table. It's a bulk drop of everything the other system, okie, no problem. When you're dealing with your user typed data, I assume it's a simultaneous pull of the field definitions and the core data. You seem to have the re-typing well in hand so we'll move forward from there briefly, will depend on a few more questions.

    1) Do you try to store the user defined mess, or is that pretty much a throwaway.

    2) Is this an SCD (Slowly Changing Dimension) scenario where you're dealing with full delete/update/insert or is this an audit structure where it's insert only?

    3) What are your locking restrictions on the primary tables? Is a full table lock allowable even if it interferes with business briefly, or must the primary table stay as available as possible at all times?

    4) Are there user-edits on these records on the SQL Server side after delivery? This idea only comes into play if there are no user-side edits to the records being brought in from the foreign system after delivery.

    4a) If you are full dumping the data from the source into your staging table, what would be the outcome of literally swapping the staging and primary table each time? Is it a feasible idea with a minimum of work to be done to the staging table?

    5) I assume you have some form of key to uniquely identify each record at a reasonable speed?

    6) Have you looked into normalizing your inbound data to try to have less data transience? I'm thinking Star Schema type data manipulations here.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • martynjreid (5/16/2011)


    Welsh Corgi (5/16/2011)


    What time constraints are you looking at?

    The ETL process takes anywhere from 30 mins to 8 hrs dependant on hardware and size of the source database.

    dumb question time: can you do the ETL to a staging database, and then use MERGE or other methods to get the delta changes to push out? or is it 20 distinct ETL's one for each client?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Craig Farrell (5/16/2011)


    I have a few other questions about what you're dealing with here, and to clarify I understand your process.

    Let's start with the staging table. It's a bulk drop of everything the other system, okie, no problem. When you're dealing with your user typed data, I assume it's a simultaneous pull of the field definitions and the core data. You seem to have the re-typing well in hand so we'll move forward from there briefly, will depend on a few more questions.

    That's right the basic process is as follows:

    1, Extraction of everything from source database into text files.

    2, Analysis of text files, reading metadata -> creation of a .sql file containing T-SQL to drop and recreate the staging area as well as a .fmt format file for each source table. This ensures any source db structure changes are coped with by the process.

    3, Execution of .sql to recreate stagin area.

    4, Bulk load with TABLOCK to load data from each text file into staging area using format files created in step 2.

    5, Indexing of the staging area prior to reporting schema generation.

    6, Generate the reporting schema (Stored procs). If a table is to be recreated it is first dropped.

    7, Index the reporting schema.

    Craig Farrell (5/16/2011)


    1) Do you try to store the user defined mess, or is that pretty much a throwaway.

    I'm not sure I fully understand the question. The user defined stuff is broken out into a multitude of tables in the reporting databse by strongly typing it as aI mentioned earlier.

    The mess is left in the staging area until the next time the process runs.

    Craig Farrell (5/16/2011)


    2) Is this an SCD (Slowly Changing Dimension) scenario where you're dealing with full delete/update/insert or is this an audit structure where it's insert only?

    The ETL process is a full delete/insert prcess.

    The real-time update process I am developing now (the question source) is an audit delete/update/insert process.

    Craig Farrell (5/16/2011)


    3) What are your locking restrictions on the primary tables? Is a full table lock allowable even if it interferes with business briefly, or must the primary table stay as available as possible at all times?

    The ETL process allows Full table lock as it is designed to run "out of hours".

    The real-time update process will leave reporting tables available at all times. I plan on using the staging area again during this process so full table locks can be placed on the staging area as it shouldn't be used by any business process. The reporting database has to be available at all times during "office hours" so any eventual updates must play second fiddle to reports being run.

    Saying that the reports whould be read-only. With the only writes occuring as part of this update procedure.

    Craig Farrell (5/16/2011)


    4) Are there user-edits on these records on the SQL Server side after delivery? This idea only comes into play if there are no user-side edits to the records being brought in from the foreign system after delivery.

    There shouldn't be. If there are I would categorise such changes as unsupportable and the Client would have to figure out a way round the problem.

    Craig Farrell (5/16/2011)


    4a) If you are full dumping the data from the source into your staging table, what would be the outcome of literally swapping the staging and primary table each time? Is it a feasible idea with a minimum of work to be done to the staging table?

    Unfortunately not; the source database design is quite archaic and not very representative of the "user-experience". The reporting database is designed to represent the user experience in an effort to make reporting from it more logical and intuitive. Each reporting table is either a breakout from 1 very large table or an amalgamation of data from a mutlitude of staging area tables.

    Craig Farrell (5/16/2011)


    5) I assume you have some form of key to uniquely identify each record at a reasonable speed?

    Yes that's right both in the staging area and the reporting database.

    Craig Farrell (5/16/2011)


    6) Have you looked into normalizing your inbound data to try to have less data transience? I'm thinking Star Schema type data manipulations here.

    It is something I would like to do but the process is more ELT than ETL and working with data in text files is something that could take quite some time to develop. I think the amount of data that could be cast aside would not provide enough benefit for the time taken to develop such a step in the process.

    Or are you thinking just importing a day's delta? Only a handful of tables have some kind of timestamp.

  • Lowell (5/16/2011)


    dumb question time: can you do the ETL to a staging database, and then use MERGE or other methods to get the delta changes to push out? or is it 20 distinct ETL's one for each client?

    The thing with "dumb questions" is they may be dumb to one party but to the other can still get you thinking...

    To be honest the MERGE statement (having just looked it up) isn't something I am familiar with. It appears to be 2008 only. I will certainly have a think about this to flesh the concept out a bit.

    After reading the process steps and answers I have detailed above do you still think there is a place for the MERGE statement, especially considering the amount of data almagamation / breakouts that are taking place?

    Are you thinking of the MERGE statement for the ETL process or for the audited changes being process real-time(ish)?

    To answer your question though: There is only one ETL process for all Clients. The resulting reporting database will have a different structure for each Client depending on how they have developed their "applications" within the software bu the process of creating the reporting database is consistent.

  • Don't Panic: MERGE is just a fancy way to do what you've already done a zillion times:, but as separate steps...INSERT new rows, UPDATE matching rows, DELETE rows that were pruned away.

    just a new, nice way to handle those operations together. a shiny new tool for us geeks, as it were.

    The real question is if you compared yesterdays data to todays data, how many rows are new, versus how many changed.

    can you post a table structure exampel that shows us the primary key?

    witht eh PK, you should(in theory, anyway) be able to compare old vs new to identify the changes.

    once that is done, you can evaluate just how many rows are REALLY changing, vs the total load.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply