Staging Database Question

  • We are working on a new ETL process using SSIS 2012. As a first step, we want to extract data from a source SQL Server database into a RAW staging database. That means the staging database will be an exact replica of the source. What's the best way of doing this? Should we even create an SSIS package for this? The source database already contains historical data, so we don't need the staging database to be cumulative.

  • imani_technology (6/13/2014)


    We are working on a new ETL process using SSIS 2012. As a first step, we want to extract data from a source SQL Server database into a RAW staging database.

    RAW in SSIS means something different, so I'm glad you clarified.

    If this is a one-shot database creation, just backup/restore the database at the target.

    If this is a per update process, I'd still do that. It's the fastest way, under most circumstances, to bring a copy of a database up somewhere. It adds a benefit of being able to check your backups as an administrative task too.

    For complete database copies, I would rarely, if ever, use SSIS for that. It's more an administrative task. You should be able to construct a simple enough script to leave up in master or a DBA_Tools database to file copy the db from the storage location to local, drop the old db, restore it, and clean up the security data. Tag the rest of your processes after that job step(s) in the job so they wait for the restore to complete.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks for the reply. I think we will need to do the backup/restore once or twice a day. What's the best way to automate this?

  • imani_technology (6/16/2014)


    Thanks for the reply. I think we will need to do the backup/restore once or twice a day. What's the best way to automate this?

    Via a SQL Agent job, which can also, of course, run your SSIS packages afterwards.

    How big is the database?

    As Craig already suggested, please don't use the word Raw when talking about SSIS - you will confuse people. An SSIS Raw File is a native performance-optimised transient storage format. Or something.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Total database size is 553.31 MB.

  • Then the backup/restore idea will be fine.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Okay, so should I rely strictly on SQL Agent Jobs or should I use SSIS? There will be many data sources for the data warehouse, and this particular source might be the only one that needs a basic backup/restore. We will also have FTP sources, Excel spreadsheet sources, etc. Each type of source will get a separate staging database. So, do I use straight SQL Agent for the one database source and SSIS for the others, or use various SSIS packages for all?

  • Why do you want to have separate staging databases for each source? Seems overkill to me. Remember that SQL Agent is just used for running scheduled or ad-hoc jobs; the jobs themselves can be T-SQL queries, stored procedures, cmdshell, SSIS packages etc. so it's not a choice between SSIS and SQL Agent. Integration Services packages can import data from a variety of sources, including all the ones you mentioned. Typically you would develop a package for each source, test it and once you're happy with it, automate its execution via SQL Agent.

    EDIT: Just to make it clear, I'm not advocating that you use SSIS for backup/restore!

    Regards

    Lempster

  • I was considering having a different staging database for each source because the boss wants this one staging database to be an exact replica of this one source. However, there will be other sources that have nothing to do with this one database source. Is there a better way to do this?

  • This just dumped off the deep end from the original question.

    Paraphrased, first question: "How do I snapshot replicate a database on another server the quickest and most effective way repeatedly?"

    With more description: "I'm setting up a monstrous data warehouse under arbitrary rules set by my manager. Can you give me your method to best make that work?"

    I misunderstood what you were looking to do at first. Bringing the source over to another DB so your current reports and some data analysis don't interfere with your OLTP system is one thing. Using it as one of multiple source DBs inside some spider-like data warehouse is another. The second requires an architect to be done well, if ever.

    I'd like to have this conversation back up a bit, if you don't mind. Can you give us the 100,000 foot view of the final result you're intending to integrate this into? There's too many assumptions to be made from the smaller bits we've gotten so far. It may be beyond this forum to really get you final answers, but I'm relatively sure the good folks around here will be able to offer enough advice to get you on the right paths.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I totally agree with you. Let me start from the beginning.

    We are working on a data warehouse. The data warehouse will have a star schema that has not been designed yet. Despite the fact that the star schema has not been designed yet, the boss wants me to start on the ETL process using SSIS. There are several sources of the data that will ultimately go into the star schema data warehouse. One is a normalized database, one is a set of files sent via FTP, one is a set of Excel .xlsx files (I think). I need to be able to: 1) extract these different sources into staging database(s), 2) transform and load the staging data into the destination star schema data warehouse.

    Also, the boss wants me to have a unified logging system, hopefully one simple enough to a project manager to look at to see what packages ran, what happened, etc.

    Finally, the goal is to have the data in the staging database(s) look as much like their sources as possible.

    This is my life...

    :rolleyes:

  • imani_technology (6/18/2014)


    There are several sources of the data that will ultimately go into the star schema data warehouse.

    What is the expectation for you moving the data twice? That's really the question here. I can't see any reason to stage most of this in some default format.

    If you stage data from your SSIS components (flatfiles/excel/etc) it's usually so you can do final passes on the data for aggregation and/or sorting (where SSIS fails, horribly, compared to the SQL engine) or so you don't have to single row update the warehouse dimensions... in which case you want all your transforms done in the staging so you're not trying to transform during the update to lower downtime.

    One is a normalized database, one is a set of files sent via FTP, one is a set of Excel .xlsx files (I think). I need to be able to: 1) extract these different sources into staging database(s), 2) transform and load the staging data into the destination star schema data warehouse.

    Again, this seems odd to me. Why would you not do this in one shot? This seems like a LOT of unnecessary disk churn from my outside perspective.

    Also, the boss wants me to have a unified logging system, hopefully one simple enough to a project manager to look at to see what packages ran, what happened, etc.

    *rolls eyes* Yes, and I'd like four hot administrative assistants with various degrees in massage and bartending along with my own personal aeroplane for my daily happy hour in Tijuana.

    You can get a simple logging report going simply by including a table somewhere and having an insert into run at the tail of your jobs as the final step. "Loader xyz ran packages a, b, and C" and another column for the timestamp. The PM doesn't need to see errors and the like, he can't fix them anyway. The logging system on the jobs will be able to cover your error reporting should one fail. The problem with redirecting a failure to another job step is that the job never actually reports the failure unless you're going to get a little crazy with the organization of it (Such as a fail on success command for the failure insertion, which can be done).

    At this point, I'd ask your boss to define exactly what he's hoping to have logged in this, and just how much extra work he wants to do on every piece that he includes into this in the future. A few simple commands that a process succeeded is one thing, integrating row count reporting is another.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Yes, that is a LOT different to the original question! 🙂

    I agree with what Kraig has said, but I would add that jumping in to the ETL process as the first step is not a good way to go about tackling a Data Warehouse project. Grab yourself a copy of the Data Warehouse Toolkit by Kimball et al; it suggests an approach to take and while you might not follow it to the letter, it will be a big help.

    Generally you don't start designing ETL processes until you've done your dimensional modelling, created a Bus Matrix, worked out whether attributes are slowly changing and a number of other tasks.

    With regard to logging, if you're intending to use SQL Server 2012 or above, there's a lot of logging provided out of the box for Integration Services, but you can also create an Audit dimension for...er....auditing. Again, the book goes into detail with regard to that.

    Regards

    Lempster

  • Evil Kraig F (6/18/2014)

    If you stage data from your SSIS components (flatfiles/excel/etc) it's usually so you can do final passes on the data for aggregation and/or sorting (where SSIS fails, horribly, compared to the SQL engine) or so you don't have to single row update the warehouse dimensions... in which case you want all your transforms done in the staging so you're not trying to transform during the update to lower downtime.

    Another reason for staging is that, in the event of a failure further down the chain, you don't have to go back to the source system(s) to extract the data again, so avoiding placing unnecessary pressure on your OLTP system.

    EDIT: spelling mistakes.

    Regards

    Lempster

  • I agree with all of your points. I have done the Kimball method before. My boss, is very technical and has substantial experience with the Microsoft stack. However, he seems to be new to data warehousing and dimensional modeling. I think he has recently taken the Kimball class.

    ETL is my responsibility and the data warehouse design is his responsibility. I think he wants me to start being productive while he works on the star schema.

    Anyway, the only advantage I see of the backup/restore to another location approach is in case of a failure. In case of failure, I wouldn't have to go back to the relational source database, which is located on another domain.

    Question: is this ENOUGH of an advantage to justify doing the backup/restore?

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply