December 5, 2019 at 11:44 am
Hi, I am wondering what peoples thoughts are on "Pre Staging" data in the ETL Process.
I have only ever used STAGE - MAIN
but a colleague is a fan of extracting raw unaltered data into a PRESTAGE table before any transformation takes place, rather than (partially) transforming the data at the point of extraction from the OLTP system.
What are the pro's cons of this?
Cheers,
Dave
December 5, 2019 at 12:07 pm
well the biggest pro I'd say is you can check issues in the data easier if transformation fails because someone was again too stupid to place the correct values in the right places in that "Manual Corrections" CSV ...
December 5, 2019 at 12:12 pm
I'm a fan of prestage
purely that people who send you data "DO NOT KNOW WHAT THEY ARE DOING" 🙂
date format is normally the killer, and if it comes from excel then you have no idea what your ETL system will hit
I suppose the benefit of prestage is that you can meddle with it and decide if you want to take the good rows and ignore the bad, or cancel the whole process
MVDBA
December 5, 2019 at 12:17 pm
well the biggest pro I'd say is you can check issues in the data easier if transformation fails because someone was again too stupid to place the correct values in the right places in that "Manual Corrections" CSV ...
hahahah - we must have been typing at the same time and hating the same thing about CSV … lol
MVDBA
December 5, 2019 at 12:18 pm
I don't necessarily pre-stage (with two separate staging areas), but I do prefer to do no (or minimal) transformations when I bring data into the staging area. All my transformations take place after the data is staged, and that allows me to have an exact copy of what was in the source.
If I don't trust the source (or the ability of SSIS to handle data type differences and conversions), I store all source attributes in my staging tables as text. This ensures that my first step (data extract) never fails.
The risk of doing transformations in flight is that if something goes wrong, you have to go back to the source system to extract the data again.
December 5, 2019 at 12:30 pm
I don't necessarily pre-stage (with two separate staging areas), but I do prefer to do no (or minimal) transformations when I bring data into the staging area. All my transformations take place after the data is staged, and that allows me to have an exact copy of what was in the source.
If I don't trust the source (or the ability of SSIS to handle data type differences and conversions), I store all source attributes in my staging tables as text. This ensures that my first step (data extract) never fails.
The risk of doing transformations in flight is that if something goes wrong, you have to go back to the source system to extract the data again.
this is where you hit the issue of data types... you end up putting everything as nvarchar(4000) , you never know what you are going to get
I once had a file come in and there was a "datecreated" field, it became clear that the users were just typing freetext - purely on the grounds that I got the word "yesterday" in a datefield.
I guess the issue is "how much do you trust your source"
one of our teams have learned to negotiate with our clients and we are using ~ (tilde) delimited files, works like a charm
MVDBA
December 5, 2019 at 1:01 pm
hahahah - we must have been typing at the same time and hating the same thing about CSV … lol
literally 2 days ago I was asked why the "Manual Corrections CSV" Import on the old system isn't working as they expect.
It's not the first time we've told them that
a) we're replacing that old crap IBM Data Manager piece of crap with an SSIS Package on new OS & SQL Server
b) unless requested we're not touching it even with a poke stick
and saying I've been asked was a matter of my superior (not the guys using this crap) asked me because they well … fear me or something like that because I'm plain honest … I just pulled up the logs from last nights Import which was something like 8000 rows imported, 70+ rows dismissed. I've provided my superior with a screenshot from the DM Log so a "BUT it MUST be something they did …" is instantly killed off.
For the record: I never ever did anything inside that DM thing except maybe copy & paste SQL Statements from there to SSIS. And two days later after they sorted their crap the data suddenly appears in the DWH, no $#&§ sherlock …
Things like woeful user input prevents me from blogging about work because all those examples are so … perfectly working … if I'd blog about my work I'd have to start with half a page of rant about what's wrong with users and relationally acceptable data, or at least sane data before I'd get to talk about anything useful.
Tho don't worry about the RDMA post, no user Inputs involved there and the NICs are already ordered 😉
December 5, 2019 at 1:40 pm
… if I'd blog about my work I'd have to start with half a page of rant about what's wrong with users and relationally acceptable data, or at least sane data before I'd get to talk about anything useful.
It's ok to rant, it exposes the issues (in this case staging data) it shows other people the issues you might get. (and it get it off your chest)
anyway, i'm interested in David's opinions on pre-stage?changed your mind or sticking with your original plan ?
MVDBA
December 5, 2019 at 3:35 pm
I learned very early on that NOTHING should ever go directly from file (or whatever) to table because of the crap factor that every mentions. I also use a staging table even for stuff that comes from well known data vendors that actually do have the reputation (in my eyes) of doing it right all the time. The reason why is that there could be a day when they suddenly don't and I'm going to need to make sure that doesn't actually stop the whole ball of wax unless it exceeds the thresholds that I specifically state. In any case, I want to be able to isolate the bad lines of data and either repair them, use them to educate (or hammer on) the provider, or both.
To wit, my recommendation is to almost always use a staging table. The exception to the rule can sometimes be where the original table is dropped/replaced or truncated prior to import and then it serves as a dual purpose. Of course, normally such things are also part of a synonym swap so they are truly acting as a staging table until I'm happy with their content and point the "active" synonym at it.
As a bit of a side bar, I never use SSIS for any of this.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 5, 2019 at 4:00 pm
I learned very early on that NOTHING should ever go directly from file (or whatever) to table because of the crap factor that every mentions. I also use a staging table even for stuff that comes from well known data vendors that actually do have the reputation (in my eyes) of doing it right all the time. The reason why is that there could be a day when they suddenly don't and I'm going to need to make sure that doesn't actually stop the whole ball of wax unless it exceeds the thresholds that I specifically state. In any case, I want to be able to isolate the bad lines of data and either repair them, use them to educate (or hammer on) the provider, or both.
To wit, my recommendation is to almost always use a staging table. The exception to the rule can sometimes be where the original table is dropped/replaced or truncated prior to import and then it serves as a dual purpose. Of course, normally such things are also part of a synonym swap so they are truly acting as a staging table until I'm happy with their content and point the "active" synonym at it.
As a bit of a side bar, I never use SSIS for any of this.
my old mentor said "that he always assumes it's his fault during the import process, but be buggered, 99% of the time its an accountant that can't use excel"
at the end of the day we protect the import processes (which is why I like pre-staging)
I don't use ssis myself (I like bulk insert and BCP or openquery) , but when we hand it over to our other teams we have to make it easy... SSDT (formerly BIDS) allows our R developers to Visualise data flow
MVDBA
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply