February 16, 2010 at 1:29 pm
I'm back with another dimensional modeling problem, this time having to do with bridge tables and weight factors. (Also, I'm not dealing with SSAS, just SQL queries...)
For discussion, I have a fact table that has healthcare claims data. Since a given claim can have multiple diagnoses, I'll use a bridge table that links a bridgekey from the fact table to the bridge table. The bridge table will also have a diagnosiskey that ultimately links to DimDiagnosis. This scenario is often cited in Kimball's books, BI bloggers, etc...
In order to ensure a dollar amount associated with a given claim is aggregated correctly when a bridge table is in play, the bridge table has a weighting factor that is multiplied against the dollar amount to ensure the actual claim amount is correct.
All of this is just fine if all diagnoses of a given claim are being considered. But, what happens when only a couple of the diagnoses of a claim are being considered for a particular query?
For example, suppose a claim for $100 has four diagnoses (A,B,C,D) in the bridge table, and the bridge table has a weight of .25 for each diagnoses of the claim. Since the claim line will blowout to four rows when joined to the bridge table, multiplying .25 * $100 and then summing the result yields a correct amount of $100 for the claim.
But, what if my query only cares about health claims where the diagnoses are A & C? What about A, C, & D? Or even just A? How do I correctly sum the claim amount when the weighting factor is no longer relevant?
I'd love to hear how others solve this problem!
--Pete
February 16, 2010 at 4:04 pm
Then please explain why are you putting yourself through this. Why do you feel compelled to limit yourself to a fatuous design pattern you once read in a book by Ralph Kimball?
February 16, 2010 at 4:10 pm
Sounds to me like a normalization problem. The dollar amount should either be treated as a line-item, which means summing them up will give a correct result, or should be stored in a row by itself, and not be summed up at all.
Why denormalize it? What's the purpose of splitting a single datum (cost) into multiple tables? That's a violation of 1NF so far as I can tell. Or am I missing something on this?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
February 17, 2010 at 5:19 am
David -- what would you recommend? I'm not attempting to swallow the Kimball pill blindly; I'm merely taking a moment to understand how this technique actually works.
GSquared -- yeah, the relationships get kind of goofy. Technically, cost is associated with a claim; specifcally, the cost is associated with a service not a diagnosis. But, when attempting to identify a specific population of patients based on a given diagnosis, we use the dollar amount field as an indicator that the diagnosis for the patient is legit. As long as the dollar amount is greater than zero, we can count a claim as real; doesn't matter what the dollar amount is, just as long as it's above zero.
Our current primary reporting table is fully denormalized; all fields for a given claim exist in a row. This approach doesn't thrill our IT director, but it does ensure that all costs are associated with all relevant claim elements on the same row -- impossible to sum dollars incorrectly.
February 17, 2010 at 7:28 am
peterzeke (2/17/2010)
...GSquared -- yeah, the relationships get kind of goofy. Technically, cost is associated with a claim; specifcally, the cost is associated with a service not a diagnosis. But, when attempting to identify a specific population of patients based on a given diagnosis, we use the dollar amount field as an indicator that the diagnosis for the patient is legit. As long as the dollar amount is greater than zero, we can count a claim as real; doesn't matter what the dollar amount is, just as long as it's above zero.
....
For that, I'd be inclined to store the cost at the service level (whatever table that would be), and then just use a Case statement like:
case when exists (select * from dbo.Services where ServiceID = Diagnosis.ServiceID and Cost > 0) then 1 else 0 end as HasACost
(I don't know your tables, so that's very much pseudo-code. Is it clear enough?)
That would give a 1/0 binary that answers that exact question.
Does that help?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
February 17, 2010 at 8:17 am
peterzeke (2/17/2010)
David -- what would you recommend? I'm not attempting to swallow the Kimball pill blindly; I'm merely taking a moment to understand how this technique actually works.
In general let Normal Form be your guide, ie. design to be at least Boyce Codd NF / 5NF. That does assume you already know what you want to model however. You hopefully know your data better than I do.
February 18, 2010 at 2:48 am
Bridge table is nothing but linkage table to link to multiple DIMs( Diagnosis) into main FACT Table. I would suggest keep the COST in the FACT table itself , add new column ( say cost applicable 1- Yes , 0-No ) into the Bridge table.
As per your example , there are 4 diag and 2 diag are applicable then the Bride table look like
Bridgekey Diag Code(Sk) Cost Indicator
1 A 1 -- Yes
1 B 0 -- No
1 C 1 -- Yes
1 D 0 -- No
This allows you to fetch either ALL diagnises or only costed diagnisis as per your requirement.
February 18, 2010 at 9:42 am
Thanks to all for offering to help.
It turns out that another person from one of the Kimballgroup.com discussion forums, where I posted a similar question, offered up an explanation that makes sense of the whole situation. The bottom line is that I was making assumptions, subconsciously, about how the data could be queried. I essentially didn't consider using multiple instances of a bridge table when querying the data; I only considered the impact of multiple instances of a diagnosis dimension connected to a bridge table. Also, the weight factor only matters if all rows of a given groupkey from the bridge table come into play: e.g., if I join a fact table to a bridge table without a dimension to filter the bridge, then the weight factor is needed.
Here's the clafiying explanation from a user called ngalemmo:
It can be solved with the right data structures. First, the claim has an FK to a diagnosis group table which contains every unique combination of diagnoses found on claims. If multiple claims have the same combination of diagnoses, they would all reference the same group. You then have a bridge table containing diagnosis group key and diagnosis key. The latter references a traditional diagnosis dimension table.
The query would be structured as follows:
SELECT ...stuff you want to see...
FROM claimFact c, diagBridge b1, diagBridge b2, diagDim d1, diagDim d2
WHERE c.diagBridgeKey = b1.diagBridgeKey
and b1.diagKey = d1.diagKey
and d1.diagCode = 'X'
and c.diagBridgeKey = b2.diagBridgeKey
and b2.diagKey = d2.diagKey
and d2.diagCode = 'Y'
You can also do all sorts of logic such as claims with diagnosis X or claims with diagnosis Y and Z...
SELECT ...stuff you want to see...
FROM claimFact c, diagBridge b1, diagBridge b2, diagBridge b3, diagDim d1, diagDim d2, diagDim d3
WHERE c.diagBridgeKey = b1.diagBridgeKey
and b1.diagKey = d1.diagKey
and d1.diagCode = 'X'
or (c.diagBridgeKey = b2.diagBridgeKey
and b2.diagKey = d2.diagKey
and d2.diagCode = 'Y'
and c.diagBridgeKey = b3.diagBridgeKey
and b3.diagKey = d3.diagKey
and d3.diagCode = 'Z')
In all cases, you only get one row per selected fact.
February 19, 2010 at 6:26 am
A table of all possible combinations? Is that feasible? If you have 10 diagnoses, you have over 3-million possible combinations, and the numbers go up fast from there.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
February 19, 2010 at 7:02 am
A way to maintain such a bridge table with the various combinations is to build it based on known diagnoses already in the source data rather than all possible combinations that could ever present themselves; new combinations would get added only when new combinations appear in the source data.
Our health services table (which presently holds 4 diagnoses fields) has 26 million rows. Querying for the distinct combinations of diagnoses yields 900,000 rows. If I were to strip out the four diagnoses fields and just use a group-bridgekey, I imagine there'd be a significant savings in storage space. Plus, querying for a specific diagnosis should be quicker than having to query four separate diagnoses fields among 26 million rows.
May 13, 2010 at 12:55 am
That sounds a bit over-engineered to me.
You didn't say whether the wieghting factor is always equal (in the example you gave it was). If that's true then if a further diagnoses is added, then you'll need to update the weighting factors for the other rows. But if they are all equal, then why store the weighting factor at all, why not work it out at run time.
Also, there's no requirement for your weighting factors to add up to 100 (or 1.0 depending on your notation). The weighting factor = the individual weighting factor / sum of weighting factors. In your example 0.25 / 1.0 = 0.25. But if you queried 3 rows, and you wanted the same $100 cost attributed across 3 instead of 4, then your weighting factors of 0.25 still work, but it becomes 0.25/0.75 = 0.33.
You can construct a case statement which will work this out on the fly, I do this already in a retail app. I want to track the performance of a range of stores based on 6 attributes, eg region, format, size. etc. Each grouping gets a performance index. To work out an individual stores index. you create a weighted index based on the indexes of the 6 groups that the stores belong too. But not all stores belong to all groups. So for a store that exists in 5 groups, teh weighting applied by each group is it's weighting factor / the sum of the weighting factors in play.
This works fine and is I've found it a lot simpler in practice, than your latest suggestion sounds in theory.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply