April 26, 2021 at 4:05 pm
I've come to find myself inheriting a loose "data warehouse" approach of sorts. I'm quite new to data warehousing concepts, ETL, etc, but I have a feeling some of the things we're doing are less than optimal, so I'd appreciate any feedback on what we should be doing differently. We're a low-to-mid size company that has started by building a single Power BI report that only relies on data from a single ERP database, but we're looking to expand our data warehousing over time.
Here is the current setup:
My biggest questions:
Thank you!
April 27, 2021 at 4:10 pm
Thanks for posting your issue and hopefully someone will answer soon.
This is an automated bump to increase visibility of your question.
April 27, 2021 at 5:19 pm
Just my opinion, but do you need ALL of the ERP data?
The reason I ask is if you aren't needing all 130 GB of data, it MAY be faster to use SSIS to just pull across the data you require. SSIS can handle data transformations as well, so it MAY be able to handle the entire process for you. My preferred approach is if the source table is required by more than 1 transformation, the source table is copied 1:1 from source to destination with a few extra columns added in (package name that copied the data over and the timestamp that the package ran. This is to make troubleshooting easier). If the table is used by ONLY 1 transformation, then keep it in memory and off disk (as disk is USUALLY the slowest operation in the chain).
Two problems that come with SSIS are licensing and memory. Licensing is an issue because you NEED a SQL Server license on the server that SSIS is installed on. What I mean is that if Server A is where you are hosting your main SQL instances and Server B is where your data warehouse is and that is all you have for servers with SQL licenses, SSIS needs to be installed on Server A or Server B (doesn't need to be on both) OR you need to purchase another SQL Server license to install SSIS on Server C.
With memory - SSIS operates OUTSIDE of SQL Server, so it needs its own memory. What I mean is if you have Server A configured to use 90% of the memory for SQL Server and you install SSIS on there, when you run an SSIS package, it will request memory outside of that 90% and MAY end up paging to disk which will make your SSIS package performance take a HUGE hit. So you need to make sure SSIS has enough memory to do its magic.
One other issue that CAN come from SSIS is package sprawl. What I mean is that a simple SSIS package becomes more and more complex as business logic increases and eventually you get to a point where the SSIS package "just works" and is a huge mess that nobody wants to touch. If you do go with SSIS, I encourage you to keep SSIS packages simple. If it starts to look complex or scary, look at what can be broken out into a second SSIS package. Keeping them simple also decreases support time. If an SSIS package has 10 steps and runs in 1 minute, testing it takes 1 minute. If the package has 1000 steps and runs in 1 hour, testing takes an hour. The more simple the package is and the faster it can run, the faster and easier it is to test and debug.
There are also a lot of tips and tricks related to SSIS that you can do. My 2 big ones are:
1 - avoid "SELECT *" like the plague as it will almost always bite you in the end.
2 - Data sources, where possible, use TSQL for the data pulls rather than from a table or view.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
April 27, 2021 at 6:13 pm
Mr. Brian Gale,
Your feedback is much appreciated, this is some awesome info on SSIS, a tool I've had limited exposure to, but I've seen it come up on some ETL-related searches. I'll definitely look into it more.
Being able to restore the entire DB does save future effort in case we want to bring in more information later, and is simpler for my current backup-restore approach, but if I were to switch to SSIS, I imagine you're right in saying that I would need to be careful in only bringing over data that is needed.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply