ETL for 4 Databases into a Single Distributed database

  • So, I have a massive undertaking, at least I believe it will be. I’ve asked a similar question in a different forum, however, this topic is similar but different.

    I have 4 databases that are similar in schema but yet some differences such as Schema X in Database A, however, not in Databases B,C and D. These 4 DB’s need ETL performed on a regular basis as these are production databases and schemas change a couple times a month. So, I need to figure out a way to perform ETL on these databases with feedback to admins.

    Requirements: Must remain production ready at all times, meaning the customer will be using while development is happening due to various reasons. Must be able to provide some sort of notification of ETL changes that require manual intervention.

  • So read the header row from your data, split it, and compare it to your table columns.  If they're not named the same, then you'll need a mapping table to help but it's just going to boil down to a full outer join between the two sets of column names.  Then you can apply whatever business logic you want for what constitutes a change.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for the response. So basically I can create a .NET application to provide additional functionality, but reading headers and splitting as you’ve suggested . Then I can repeat a similar process for stored procs etc. ? I’m not sure that SSIS will do everything I need which I why I brought up .NET. These databases are several TB in size ~10.

  • SQLDude wrote:

    Thank you for the response. So basically I can create a .NET application to provide additional functionality, but reading headers and splitting as you’ve suggested . Then I can repeat a similar process for stored procs etc. ? I’m not sure that SSIS will do everything I need which I why I brought up .NET. These databases are several TB in size ~10.

    No .NET application required.  Just use BULK INSERT to read the first row of the files.  And I wouldn't go near SSIS for this.  Splitting the headers you read is easy with a string splitter in T-SQL.  If you want to read file names, you could do that with xp_CmdShell or, if you can't convince people of the mistake of forbidding its use, you could always use xp_DirTree (with all 3 parameters) to get a listing of the files and dump that into a TempTable.

    And I'm not sure what size has to do with any of this.  You're not loading 10TB of data in one go, right?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • So, will this work from moving data from SQL 2012 to SQL 2017? Would this need to be setup as a SQL job?

    im not trying to be a pain, but my boss is die hard developer and codes everything. I’d rather not reinvent the wheel if we don’t have to. As mentioned I just need to be sure that if manual intervention is required then there’s notification. Manual intervention would changes in stored procs, triggers and functions. Right now he’s planning on a .NET application with a UI. This will work with data as well correct?

  • My apologies.  I normally don't associate the term "ETL" with moving data between databases even when one may be a source and another is a data warehouse.  I normally associate ETL as importing data from non SQL Server sources such as files.

    However, the same principles can apply with the requirement to check for changing schemas.  Instead of reading the header from a file, you can read schema (column names and datatypes if you need that level of detail) from tables and do just about everything the same way that I suggested with files.

    We do this with about 50 tables each night, reading the data from IBM PowerSystems/DB2 into SQL Server.  We maintain 2 sets of tables on the receiver side.  We load one set while having the other set online.  When the set being loaded has completed, we just repoint some synonyms.  The total downtime is measured in milliseconds and users never see it.

    It's all done with T-SQL... no SSIS needed but... you could certainly use SSIS for such a thing but it's overkill and brings little else to the table.  It can help you avoid linked servers but I don't find those to be a problem in the area of performance simply because of the way we use them.  SSIS can have similar performance issues if you don't use it right so we just stuck with the linked servers.

    We also use this same technique of having duplicate tables and repointing a synonym to the new/active data tables in a large number of places between SQL Servers, as well.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply