June 5, 2020 at 4:35 pm
Hello experts,
I was hoping to get an idea of some common practices for storing SSIS packages (dtsx and/or ispac) in source control, specifically git.
Our SSIS packages tend to be large, but not overly complicated. We have some that are more complex than others, and some that are simple things like pull table from live and push to datamart, no transform or anything fancy, just a 1:1 data pull. Others end up being a lot of steps and transforms and are high CPU/memory packages doing a lot of transforms and lookups.
So the questions that I have are
1 - what are some common practices when building SSIS pacakges? Is it more common to have a LOT of small, specific packages (like package A pulls data from table A to datamart, package B pulls data from table B to datamart, package C merges those 2 data sets in the datamart to be consumed in a report) OR is it more common to have a few more complex packages (like package A pulls data from table A and table B into the datamart and then merges the 2 data sets in the datamart to be consumed in a report)? our approach has been more option 2 here, but this does result in some tables being refreshed multiple times per ETL run which is why the first option seems to make more sense dispite feeling like a lot more packages.
2 - How do you organize this in source control? Do you have a single project in source control that holds all the DTSX's? Do you have multiple projects in git, such as SSIS_Staging, SSIS_ETL_Finance, SSIS_ETL_Timesheet, SSIS_ETL_HR, etc? If you have a single project in source control, do you have a bunch of folders in there for it?
3 - how do you break up the folders/projects from question 2? Would you break it up like SSIS_Staging_AdventureWorks for all the staging tables that are coming from AdventureWorks or would you not have a specific "staging" one and have it more on the process or group it gets the data from (such as HR)? And if you do name it per process, how do you handle cases where the data from that ETL may be used in multiple processes? One example, if you had a table in the datamart that contained products and their quantities (ie Inventory data) but also contained the price we pay to buy and build as well as the price it is sold for (ie Financial data)?
We have been doing it very ad-hoc and have gotten into a state where our developers and DBA's are unable to easily and reliably find packages in source control and are looking to re-do the source control setup we have for our SSIS packages and are hoping to get advice from experts on it.
Also, do you store both the ISPAC and the DTSX or just the DTSX files?
Thanks for any and all input!
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
June 5, 2020 at 11:15 pm
Brian,
Regarding how to break up logic into individual packages, I'm a fan of smaller SSIS packages that each do just one unit of work. This design makes development, testing, and troubleshooting much easier. I blogged about this topic a while back: https://www.timmitchell.net/etl-atomicity/
As to separating packages into projects, I usually keep each subject area of data in its own Visual Studio solution, and then break down the parts of that load into separate projects in that solution. For example, there might be an HR solution containing two projects: HR Staging and HR Production. Because the staging load and production load would likely use different source and destination connections, it makes sense to break those apart into different projects. Keeping the individual projects separate also makes the deployment process easier (assuming you are deploying to the SSIS catalog, which uses a whole-project deployment by default) by minimizing the number of moving parts that are touched with each change.
For the assets to put into source control - I just source control the entire project folder. By default it's also going to add the ISPAC files there too, but I consider the project source code (.dtsx packages, parameter files, project connection files, etc.) as the assets I actually care about in source control.
I hope this helps.
~tim
Tim Mitchell, Microsoft Data Platform MVP
Data Warehouse and ETL Consultant
TimMitchell.net | @Tim_Mitchell | Tyleris.com
ETL Best Practices
June 8, 2020 at 1:27 pm
My own practices are mostly in line with Tim's.
But don't put ISPACs in source control, for the same reasons you don't put EXEs in source control ... it's a waste of time and space.
Getting the balance right with how you break your packages into projects is an art form. If you have too few projects, you may find that:
Whereas if you have too many projects:
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
June 8, 2020 at 4:56 pm
Thanks Tim and Phil!
It is nice to hear that our setup isn't anything out of the ordinary and that other experts have similar issues with it that we do.
Our plan was not to store the ISPAC's, but I wasn't sure if they were like exe's. I know I generally exclude the entire BIN and OBJ folder for C# applications, and will likely do the same with this.
I think our biggest challenge is making sure that we name things properly so it is easy to find the packages - one problem we currently have is a package will be in 1 of 4 git projects and they are not organized well so we don't know where to find things. We just want to re-do things so it is all in a nicely organized set of git projects.
One challenge we have is that some of the process groups end up with very few packages, while others will have a lot. Like a Month End process will have a lot of packages while others may have hardly any. And we will end up with some that may overlap. Like if we need to keep any staging tables for month end processes (for example), they will likely contain some similar tables as what are used by our nightly refresh and hourly refresh. Mind you, "hourly refresh" we are working on getting changed to just pull live data as it is mostly used to generate reports that are near-real-time when end users requested real-time data.
I think we have a lot of work ahead of me (as a the team... not all on me), but it is good to hear that my preliminary thoughts are not too far out to lunch. Keeping the DTSX's simple and single-purpose (such as copy data from live to staging with data sanitization where applicable), and finding a good balance for projects is what we need to do. Here I think our best bet is to grab all of the DTSX's and dump them into a single folder, then meet with the DBA team (the team who creates and maintains these) and figure out how we can sort them in a manner that makes the most sense for our team.
Currently, we are not using any project-level parameters that I am aware of. This is because we currently tend to group similar data moves all into a single DTSX which results in us having not that many DTSX's, but having complex ones that are difficult to maintain and sometimes having multiple DTSX's that populate the same staging table. That needs to get fixed, and that is where having a set of "Staging" DTSX's is a much nicer process in my mind. You can refresh your staging tables in the order you see fit and refreshing a staging table doesn't impact the reporting tables.
Again, thanks for the input! It is appreciated!
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
June 8, 2020 at 7:52 pm
Our plan was not to store the ISPAC's, but I wasn't sure if they were like exe's. I know I generally exclude the entire BIN and OBJ folder for C# applications, and will likely do the same with this.
Agree! No need to source control these Build artifacts.
Regarding package naming. I add a two-character prefix to package names which corresponds with the project they are in. Initially, I did this because I needed package names to be unique right across the instance (my home-grown ETL job-execution framework requires this), and I've grown to like it. Removes all confusion when referencing package names.
Sounds like you've got some fun refactoring ahead of you. Best of luck.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
June 8, 2020 at 8:33 pm
We have a set of SSIS packages for a data warehouse ETL, all the dtsx packages are in one solution (project) the entire project is checked into source control (currently TFS). There is one dtsx for each table transform and two additional packages one for the import to staging which calls each staging table package in turn, the other dtsx package to call each transform package for inserting/updating the Dim/Fact/Bridge tables.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply