April 21, 2020 at 4:43 am
(Edit: Should this be in the SSDT or Integration Services forum?)
Hi,
This post may be a bit like "how long is a piece of string", but any advice you can give is most welcome.
Background:
My source data is about 30 tables. I don't own or control this data. IMO the data modelling is poor: it's halfway between a data warehouse and data mart; many table are “half flattened”; many tables don't have a primary key; repeated data across tables, but still many tables need to be joined for typical downstream use; poor indexing; sometimes the schema is poor (eg. float or numeric for integer data, dates, ages, etc stored as varchar). All tables are SCD2.
The source data is in two databases, my target database is a third database, all on the same server.
I have to take this data and create a flattened, very wide (~ 400 columns) table based on the current records in the SCD2 tables.
Here is my high level approach:
• I've created dbo.<tablename> synonyms to the source tables in the other two databases. There are no naming clashes between the source table names and the tables in the target database.
• I've created dbo.vw<tablename> views using the synonyms. These are just an abstraction layer between the source tables and the calling code. They are primarily SELECT * (with expanded column list), with column reordering so the natural keys are on the left, the SCD2 dates, current record flag, and record uuid at the right, and all other columns sorted alphabetically. Except "matching" dates are next to each other, eg. episode start date, episode start time, episode end date, episode end time (these aren't in alphabetical order). Some of the source tables are wide, and this column order works for me. I've cast some columns to the correct data type - these casts are minimal. Occasionally I'll rename a column if it is the same data value with a different name in two separate tables. I use these views in downstream code, rather than the table names (or synonyms) themselves.
• I've created cur.vw<tablename> views with WHERE current_record='Y'. The source is the view above. Filtering on current_record is used ALL THE TIME in downstream ETL.
• I've created dup.vw<tablename> views. Unfortunately our upstream ETL is "flawed", and there can be > 1 rows with current_record='Y'. This causes Cartesian problems on downstream joins. These views also use the dbo.vw<tablename> views instead of the cur views, I've repeated the WHERE current_record = 'Y', I use a CTE and the typical ROW_NUMBER() OVER (PARTITION BY natural keys ORDER BY date columns DESC, then WHERE ROW=1 in the outer query. These views are a PITA performance wise but I get no joy from my upstream data provider re: fixing their crappy ETL, poor indexing (to prevent the problem), and bad historic data. BTW I like my schema names to be three characters: dup actually means (de)dup.
• So, so far I've only got two levels of nested views: dbo and cur|dup. Three if I used the cur views in the dup views.
• There are two tables that are "vertical/tall" that need to be pivoted to "horizontal/wide". For these I create 4 (2 x 2) views:
dup.vw<tablename>_V, which does some minor transformations using ROW_NUMBER and UNION ALL. It uses the dup.vw<tablename> as its source, which is already horizontal.
dup.vw<tablename>_H, which pivots the vertical table to a horizontal table. It uses dup.vw<tablename_H as its source.
These _H tables use Jeff Moden’s approach to do the pivot: https://www.sqlservercentral.com/articles/cross-tabs-and-pivots-part-1-%e2%80%93-converting-rows-to-columns-1
So, these _H views have 4 levels of view nesting. I will say that the pivot performance isn’t bad (tested using cur as input rather than dup). The biggest performance hit is the deduping.
I’m now ready to create my huge, flattened table. I have (at least) these options:
1) I could ditch all these views, write a tailored, specific set of queries against the tables themselves. AFAIK the biggest performance gain would be only selecting the columns actually needed to supply the final table, rather than all the available columns in the source table. Using the views, I’m “over selecting” for the final table. There are some (many?) columns in the views that aren’t needed to build the final table.
2) Using CTE's and these views as sources, I could create a huge "uber" view that would output the final table. This view could have many levels of CTE’s, perhaps 6-8 deep at its deepest point? The view code might be 1000-1500 lines long. It might take 20-30 mins before it even started streaming output. It would never be selected directly, but just used as the source for the final table. It would take advantage of any indexing on the source tables, and stream the data instead of writing to intermediate tables. This should also use parallelism as determined by the query plan. I don’t know the details whether data would need to be written to spool files under the covers.
3) I could use SSIS to extract all the data in parallel as an extract/staging step, then join the extracted data as a transformation/load step. I could either use the above views as sources in SSIS, or embed the individual queries in SSIS. In SSIS I could select only the columns needed from the view. I haven’t been able to get SSIS to work with temporary tables, so I write my extracts to a tmp schema in my target database and drop tables afterward. The data is sensitive; I can’t write to global temp tables. Writing to the tmp schema in the target database would have increased logging over tempdb tables (right?). These extracts would be heaps unless I indexed these staging tables; if not indexed, then the many left joins to flatten the data would not have indexes to make those joins perform better. But is that also the case when data is streamed through many CTE’s or sub-queries?
4) I could write a stored procedure and write to temp (tempdb) tables. AFAIK, though, the extract would run serially rather than in parallel.
Sorry for all the detail. If you’ve made it this far, well you’re probably in the minority 😉
My questions:
1) Are all these views over-engineering? I do think they could be used in other code developed by my colleagues though. But yeah, how hard is it to add WHERE current_record = 'Y' to a query. I did take an online course on SQL Server ETL where the instructor recommended views vs. tables themselves to prevent code from breaking if there were table schema changes in the future.
2) Are nested views ALWAYS bad? Is there any time when the code reuse is worth the performance hit caused by view nesting? Given these simple views, AFAIK the main problem is over-selecting columns that won’t get dropped until the final table, increasing memory usage, etc.
3) How would YOU approach this, given the approaches listed above (or another one I haven’t mentioned)?
Thanks
April 21, 2020 at 1:44 pm
I'm not following the benefit of the synonyms for the tables and the first layer of views, you're doing that JUST to re-order the columns?
As for the direct query vs ETL into staging, if it's already moving from one database to another, I would avoid SSIS myself and put it in a stored procedure. Index the empty staged tables, so you get the benefit as soon as the data is in. For your numbered options:
1 - you don't have to stage into the final form, maybe stage into tables that match your 'view' forms. that saves on the "transform" part of the loading. Test to see what works best.
2 - Uber view here could be on those staged tables, rather than on a lower layer of views.
3 - definitely index
4 - I would go this route. don't think using SSIS is going to give you better performance.
for your questions:
1 - over engineering? Only if you don't use them for anything. If others get use, and it saves time elsewhere, then not a waste. If you're doing all this just to import it occasionally? then yeah, probably.
2 - are nested views always bad? your ROI is going to be your ROI. You have to test this. But if you're talking about 5-6 layers of views? Yeah, probably always bad at any scale.
3 - try it, with a test harness, find what works for your situation.
-------------------------------------------------------------------------------------------------------------------------------------
Please follow Best Practices For Posting On Forums to receive quicker and higher quality responses
April 21, 2020 at 9:00 pm
Create two base source tables: one with only current_record = 'Y' rows, and one with only current_record <> 'Y' rows. Create a SQL constraint on the each table that enforces that restriction.
CREATE TABLE ... ( ...cols...,
CONSTRAINT <constraint_name> CHECK(current_record = 'Y')
Then you can create a view that UNION ALLs the two base tables. When the query runs against the view, if "current_record = 'Y'" is specified, SQL will then "know" to read only the table with those rows. Likewise for "current_record <> 'Y''. If you don't specify a condition, naturally SQL will read both tables, but that is presumably what you'd want in that case anyway, both current and non-current data.
BTW I like my schema names to be three characters: dup actually means (de)dup.
The schema name, like all business names, should be accurately descriptive, even if that takes more than 3 chars. Then name being the exact opposite of its actual meaning is an especially bad idea. That's just as confusing as all <heck>.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply