May 10, 2018 at 11:26 am
Can anyone weigh in on why you would want to store multiple variations of the same data? In all of my time, I have always seen the following columns.
First Name
Last Name
Middle Name
I came across a DWH which had the following columns which perplexed me...
First_Name
Last_Name
Middle_Name
FullName
FullName_Caps
FullName_Caps_LastFirst
FullName_LastFirst
FullName_TitleCase
FullName_LastFirst_TitleCase
I can't think of any reason why you would want/have to store all the different permutations of this data. Possible indexing/searching came to mind, but I wasn't sure and the people who built it are long gone.
May 10, 2018 at 11:36 am
It looks like they're all formatted differently, the reason is so that they can be pulled in that format without having to do it at run time. As for why who knows, maybe whatever tools that pulls the data isn't as good at formatting as whatever is loading the DW?
May 10, 2018 at 11:55 am
They are all formatted differently... I have always just done the formatting at runtime because of the overhead of storing all of those permutations of data I thought would be huge. 50+ million rows of data versus upper(concat(lastname,' ',firstname). I first thought they might be calculated columns, but nope. They are being pumped in via the ETL process...
I thought there might be some hidden golden rule in the DWH toolkit book that I overlooked...
May 10, 2018 at 12:29 pm
Kevlarmpowered - Thursday, May 10, 2018 11:55 AMThey are all formatted differently... I have always just done the formatting at runtime because of the overhead of storing all of those permutations of data I thought would be huge. 50+ million rows of data versus upper(concat(lastname,' ',firstname). I first thought they might be calculated columns, but nope. They are being pumped in via the ETL process...I thought there might be some hidden golden rule in the DWH toolkit book that I overlooked...
It's probably to avoid doing it at runtime, in the interests of speedy retrieval. I can think of no other reason.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
May 14, 2018 at 11:24 am
Kevlarmpowered - Thursday, May 10, 2018 11:26 AMCan anyone weigh in on why you would want to store multiple variations of the same data? In all of my time, I have always seen the following columns.First Name
Last Name
Middle NameI came across a DWH which had the following columns which perplexed me...
First_Name
Last_Name
Middle_Name
FullName
FullName_Caps
FullName_Caps_LastFirst
FullName_LastFirst
FullName_TitleCase
FullName_LastFirst_TitleCaseI can't think of any reason why you would want/have to store all the different permutations of this data. Possible indexing/searching came to mind, but I wasn't sure and the people who built it are long gone.
Rarely does a decent data warehouse NOT have a certain amount of "de-normalized" data that provide for useful ways to create dimensions without having to pre-process 50 million rows first. Having all the variations at hand means NOT having to do a rather sizable bunch of work at run-time. De-normalization is pretty much "normal" for a data warehouse precisely because the volumes are usually large enough to justify not wasting processing time on getting a variation on your theme, so to speak. Cube processing can benefit significantly from that kind of thing.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
May 14, 2018 at 11:29 am
Data warehouses are often denormalized. However, this data is so unlikely to be needed that all columns other than, say, FullName, should be put into a separate table, which can be joined only when needed. 1-1 table relationships are also a lot more frequent in DW and other BI-type (Business Intelligence) solutions.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
May 14, 2018 at 2:07 pm
Really depends on the server too. In the MPP world, data is essentially in columnstores. You're paying for the extra compressed space to reduce the overhead of having to format the data when called up. You also run into similar if this is split across tables as in most instances, you get one hashing key per table, which means you can only essentially index on one dimension. Thus data is duplicated to create more than one hashed table.
May 15, 2018 at 10:09 am
I can understand the fullName equivalent because indexing and/or searching on that is likely better than putting a concat(a,' ',b) on the left side of the search string... I'm wondering how much actual cpu time is saved versus the cost of the storage involved. I can't imagine concat being an expensive function.
May 15, 2018 at 11:59 am
Personally, I think it's a totally unnecessary duplication of data and it doesn't stop at just what is visible in the table. It's also likely that the same people that rendered this expensive design also add an index to many, if not all, of the columns. Don't forget that SQL Server can't actually use what is stored on disk. It has to load it into memory before it can be used.
What's really ironic about a lot of designers (certainly, not all of them) of DWs that I've seen is that they denormalize data for supposed purposes of performance and then they pitch a fit when it doesn't perform well because of the Catch-22 they've designed into their DWs. Then they exacerbate the problems by adding a shedload of indexes and then wonder why it takes a month of Sundays to update the data in their DWs and then also wonder why they have huge split pages/fragmentation problems (because those same designers also don't know how to design the indexes correctly) and the also wonder why maintenance periods and backups take so long and why the backups are so large.
On the other hand, it IS a good way to justify buying additional hardware even though that won't usually help performance, either. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
May 15, 2018 at 1:43 pm
Jeff Moden - Tuesday, May 15, 2018 11:59 AMPersonally, I think it's a totally unnecessary duplication of data and it doesn't stop at just what is visible in the table. It's also likely that the same people that rendered this expensive design also add an index to many, if not all, of the columns. Don't forget that SQL Server can't actually use what is stored on disk. It has to load it into memory before it can be used.What's really ironic about a lot of designers (certainly, not all of them) of DWs that I've seen is that they denormalize data for supposed purposes of performance and then they pitch a fit when it doesn't perform well because of the Catch-22 they've designed into their DWs. Then they exacerbate the problems by adding a shedload of indexes and then wonder why it takes a month of Sundays to update the data in their DWs and then also wonder why they have huge split pages/fragmentation problems (because those same designers also don't know how to design the indexes correctly) and the also wonder why maintenance periods and backups take so long and why the backups are so large.
On the other hand, it IS a good way to justify buying additional hardware even though that won't usually help performance, either. 😀
You guys are UPDATING your DW? What? 😀
May 15, 2018 at 2:44 pm
xsevensinzx - Tuesday, May 15, 2018 1:43 PMJeff Moden - Tuesday, May 15, 2018 11:59 AMPersonally, I think it's a totally unnecessary duplication of data and it doesn't stop at just what is visible in the table. It's also likely that the same people that rendered this expensive design also add an index to many, if not all, of the columns. Don't forget that SQL Server can't actually use what is stored on disk. It has to load it into memory before it can be used.What's really ironic about a lot of designers (certainly, not all of them) of DWs that I've seen is that they denormalize data for supposed purposes of performance and then they pitch a fit when it doesn't perform well because of the Catch-22 they've designed into their DWs. Then they exacerbate the problems by adding a shedload of indexes and then wonder why it takes a month of Sundays to update the data in their DWs and then also wonder why they have huge split pages/fragmentation problems (because those same designers also don't know how to design the indexes correctly) and the also wonder why maintenance periods and backups take so long and why the backups are so large.
On the other hand, it IS a good way to justify buying additional hardware even though that won't usually help performance, either. 😀
You guys are UPDATING your DW? What? 😀
😀
--Jeff Moden
Change is inevitable... Change for the better is not.
May 18, 2018 at 9:49 am
The design pattern is to achieve one or more of the following, some of which have already been mentioned.
Take the calculation hit once at load time rather than burn CPU cycles on conversion at read time
Indexing
and one I haven't seen mentioned - consistency. If the data is conformed on the way in then you know that power users will use one of the conformed fields rather than roll their own.
As with most things databasey - it depends - on whether this is a good design pattern or not.
May 20, 2018 at 8:08 pm
aaron.reese - Friday, May 18, 2018 9:49 AMThe design pattern is to achieve one or more of the following, some of which have already been mentioned.
Take the calculation hit once at load time rather than burn CPU cycles on conversion at read time
Indexingand one I haven't seen mentioned - consistency. If the data is conformed on the way in then you know that power users will use one of the conformed fields rather than roll their own.
As with most things databasey - it depends - on whether this is a good design pattern or not.
Agreed... that's why I'm suggesting that people look a little bit more closely before they make this table 6 times larger than it needs to be.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 21, 2018 at 9:34 am
Now, having said that, many decades ago I had a client who had five different ways of evaluating his inventory. Each one involved in getting accounting rules, a legal department, and other stuff that I don't even pretend to understand. Have you ever heard the joke about the Spanish bookkeeping system? You keep multiple sets of books; the creditors books, the governments books, the investors books, etc.
It was not worth it, because the complexity involved for each of the methods, to compute them in a query so we allocated a column for each method. Then we picked the one we liked for whatever purpose we were using. Hey, I don't teach business ethics; I teach database.
in on why you would want to store multiple variations of the same data?
Please post DDL and follow ANSI/ISO standards when asking for help.
May 21, 2018 at 9:46 am
This is really a question of semantics, if it is necessary to have a properly conformed name or address for legal purposes, I would argue that that is not a presentation issue. This conformed data must be provided by the system, Whether you compute it once and serve it up from a computed column for do it via a view or function at run-time is a design decision and there are valid reasons for doing both. The lines become more blurred when trying to calculate subjective values (e.g. account delinquency). Should the database return a scalar value for the application to assign to a banding and color code, or should the database rules apply the banding and just leave the display parameters up the application. E.g. should delinquency be 82%, 'Severe' or 'Red'. - I would argue (almost) never Red - that is a presentation decision. The score or banding is more subtle and really depends on the business requirements.
@jeff, in your last statement you referred to making the database bigger than it needs to be - but who defines need. Storage is cheap and the compute probably happens outside of business hours. If speed of delivery is the most important thing then I say calculate the values and store them in derived columns, after all, speed of calculation is one reason you have built a data warehouse in the first place.
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply