July 12, 2019 at 6:36 pm
Let me preface this by saying I'm relatively new to SSIS, somewhat new to SQL Server, but not new to database work in general -- most of my experience is with PostgreSQL and MySQL. (And the only thing I ever want to do with MySQL again is migrate off it!)
That said, I have inherited an ETL process that I'm updating to do incremental loading. The destination is really a reporting server not a true data warehouse, so the transforms involved are not necessarily what would be used for a star or snowflake schema. Ultimately, it will be fed from multiple data sources, but for now, I'm just working with one.
In the step I'm currently working on, each source data row has (among other columns) a start date, an end date, and columns for week1-week14. The goal is to unroll this so there is one row for each week from start to end (can be 1 week up to 14), with the value from the matching weekN column.
The current, non-incremental, process makes extensive use of staging tables and hand-written SQL in Execute SQL tasks; the SQL Tasks are badly formatted, hard to read, hard to maintain, and just generally complex. I've been switching to using Data Flow components, with Lookups for the incremental load, and up to this step, it's been as fast or faster -- we're not dealing with huge amounts of data here; a couple hundred K rows or less for the most part, so avoiding all the disk IO makes up for a lot.
I've written a Script Task to do the date range expansion, but it's unacceptably slow. Even though it's processing less than 200K rows, it's taking about 7 hours(!!) to complete. I've had a coworker look over the code, and tried a few things to speed it up a bit (like switching from dynamically building the output row based on the columns in the input row to specifying each column individually), but with no luck so far. I've attached the two non-boilerplate functions as they currently exist, but I think the answer is more likely to be "Do [X] instead of the script task" -- I just have no idea what [X] might be, other than switching back to staging tables and Execute SQL tasks, which I'd hoped to avoid.
Any other ideas, or is that my best bet?
July 12, 2019 at 6:44 pm
from what you are saying it seems the main issue is the unpivoting of the data.
Here is a post about using the unpivot component. This may help get you started.
July 12, 2019 at 7:03 pm
I was resistant to using staging tables at first, but later realized they can be the right tool at times. I have approached similar tasks by loading a staging table using a data flow, then doing the unpivoting in SQL. We're on 2008 R2, so I unpivoted "manually" but it's still fairly fast,
July 12, 2019 at 8:18 pm
You might find that you can expand the data ranges very rapidly in T-SQL, by joining to an appropriate calendar table or numbers table (it's hard to be sure what will work best, without first seeing your source data and desired results).
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
July 12, 2019 at 8:32 pm
@alphatangowhiskey -- Ah! I missed that one, hiding down at the bottom of the Other Transforms list! Thanks; I'll see if I can get what I need from it.
@Phil Parkin -- I do have a calendar table available, but with no way i can find to specify a between condition for the join, I can't figure out how I can use it usefully, without putting the data in a staging table and writing an Execute SQL task.
@palandri -- I'm not ruling out that option, but if I can work without it, I'd like to, mostly because I'm not using staging tables anywhere else in the process so far.
July 12, 2019 at 8:49 pm
Was also going to suggest the calendar table as a way to expand dates. I have a similar process that i need to get a date range down to 1 row per minute so i can do rolling aggs (deltas) by minute, and its super fast.
But yeah forget any between join in SSIS do that as a separate DFT in the source component and stage it.
July 12, 2019 at 8:52 pm
The staging table / stored procedure route is the way I would go. T-SQL will likely blow away anything you can code in SSIS to do this.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
July 12, 2019 at 9:00 pm
Totally agree, i mentioned it since they seemed to want to get away from the SQL tasks, but calling a sproc is minor.
I always tell the devs who are trying to use a set pattern for consistency that the best way is to use what works the best.
July 12, 2019 at 9:33 pm
Let's just say I'm using this as a learning experience and want to try different ways, even if I end up going back to staging tables in the end.
Also, seriously, what I inherited is _ugly_. The Execute SQL task in the original is using three staging tables, updating two destination tables, and using a cursor for part of the unpivot...all in a proportional font, because SSIS is dumb like that, and with inconsistent spacing and capitalization. The formatting is fixable, of course, but it's not something I want to try to maintain...or something I'd be proud to pass on to my successor. Is it any wonder I'm liking neat little boxes and arrows when possible?
July 12, 2019 at 9:38 pm
I use Aqua Data or SSMS to develop SQL code, and view the Enter SQL Query window in the SQL Task Editor strictly as a paste destination. Unless the change is extremely trivial, I copy the code back to Aqua Data to make changes.
July 12, 2019 at 9:45 pm
Oh that does sound ugly and a cursor *looks around for Jeff* cursors aren't so bad 🙂
Yeah i feel your pain, i rarely look at or edit TSQL in the stupid little SQL task editor, i always put it in an ssms query window. Why they couldn't used a fixed width font is beyond me.
When a SQL statement outgrows the editor i usually consider a sproc instead. Generally use them for simple stuff like truncating tables or calling sprocs.
edit: before i get roasted the cursor thing was a joke.
July 15, 2019 at 5:56 pm
Yeah, when I have to work with the long, ugly queries, I usually pop them in to either gVim or SSMS, but it's an extra step, and doesn't change the fundamental flaws in the design (of the ETL, but also the SSIS query "editor").
July 29, 2019 at 9:35 pm
This update is primarily for anyone who comes across this in the future, with a similar question.
For my case, Unpivot was slower than the script component, because I was able to do some short-circuiting in the script that was not possible in the Unpivot component.
With some additional tweaks upstream, I've taken it from 7.5-8 hours down to a bit over 2 hours, but I need it to run in minutes, so now I'll be building something with a staging table for the bit that needs to do a range join.
It's been an interesting learning experience!
August 15, 2019 at 11:38 pm
Another update:
With some additional tweaks and improvements, the version that's using a staging table is running in about 10 hours, and the version that's using a script is running in about 4.5 hours. Not sure why the staging table version is taking so long -- it's something in the load-from-staging half of the process, but nothing I'm doing there seems all that complicated to me. So...I keep working.
August 15, 2019 at 11:54 pm
It would be interesting to know more about the process which is taking all the time.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply