Star Schema Design - Multiple Sources

Question

Star Schema Design - Multiple Sources

kevin.obrien 66193

SSCommitted

Points: 1625
More actions
September 29, 2016 at 8:40 am

#330602

Hi,
I am planning the design of a star schema for a data warehouse. The data will be loaded from 2 separate staging tables (ClientTbl and SystemTbl).
The end user will need to be able to compare any differences between the 2 sets of data. They will also need to be able to match items from either source. e.g. ItemA from SourceA matches ItemA from SourceB.
For the design, I am considering having both sources on one row in the Fact Table. So for SourceA (ClientTbl source), there would be NULL values for SourceB columns (SystemTbl source) and vice versa. The 'Source' column is then used to flag whether the data is Client source or System source.
I am thinking it will be easier to do this design especially when we may need to match items. So on a match, the NULLs would be replaced with the matched item's values.
Am I correct with this design or is there a better solution? Thanks!
Screenshots of table data and schema design attached.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 1

If you're just doing a comparison between two data sources that have the same attributes and the only difference is the source, then why not combine both of these attributes into one and add the SourceCode in the dimension not the fact?

For example:

DimProducts

ProductID

SourceCode

ProductName

The SourceCode would be:

N = No Source

C = Client Source

S = System Source

Then you can remove some of those other Key fields from your Fact and just focus on assigning your new dim like DimProducts back onto the table with the unique identifier like ProductID? This should allow you to query the Fact and filter between Client and System by filtering the dimension like SourceCode = C.

The only problems in my example is what happens if you have 100,000 products? This gets duplicated per source. If you have 2 sources, then that dimension now has 200,000 records. For each source you add, it will add 100,000 records to that table.

Jack Corbett SSC Guru Points: 184397 More actions · Answer 2

xsevensinzx (9/30/2016)
If you're just doing a comparison between two data sources that have the same attributes and the only difference is the source, then why not combine both of these attributes into one and add the SourceCode in the dimension not the fact?
For example:
DimProducts
ProductID
SourceCode
ProductName
The SourceCode would be:
N = No Source
C = Client Source
S = System Source
Then you can remove some of those other Key fields from your Fact and just focus on assigning your new dim like DimProducts back onto the table with the unique identifier like ProductID? This should allow you to query the Fact and filter between Client and System by filtering the dimension like SourceCode = C.

I agree with this design. The only thing I'd disagree with is the following:

The only problems in my example is what happens if you have 100,000 products? This gets duplicated per source. If you have 2 sources, then that dimension now has 200,000 records. For each source you add, it will add 100,000 records to that table.

The design proposed by the OP also has a row per source per product just with a bunch of NULL columns.

I'd go with:

Source

SourceIdentifierKey

PeriodKey

CurrencyKey

Quantity

Price

And all the columns should be not null.

For reporting if you need to see the Client and System data side by side in one row instead of a 2 rows you can just pivot the data.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

kevin.obrien 66193 SSCommitted Points: 1625 More actions · Answer 3

Hi,

How do I allow for an item in SourceA to be matched to an item in SourceB?

So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.

Thanks!

MMartin1 One Orange Chip Points: 27867 More actions · Answer 4

kevin.obrien 66193 (10/9/2016)
Hi,
How do I allow for an item in SourceA to be matched to an item in SourceB?
So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.
Thanks!

I am not able to see the images you attached. Perhaps you provide a sample of what your data looks like for the product dimension. My immediate hunch is that source is not really an attribute of a product. If you only get productA from source 1 and productB from source 2 , etc.. then I can go with keeping the source attribute in the product dimension. As you describe however, a product can come from both sources thus are to be 'matched.' I assume this means to see if they are the same product. So I would separate out the source especially if the number of sources can range in the hundreds.

----------------------------------------------------

Jack Corbett SSC Guru Points: 184397 More actions · Answer 5

kevin.obrien 66193 (10/9/2016)
Hi,
How do I allow for an item in SourceA to be matched to an item in SourceB?
So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.
Thanks!

I think all reporting platforms have the ability to PIVOT/Cross Tab the data in report, which is where I would to the pivoting because it is simpler there. But, you can PIVOT in SQL something like this:

DECLARE @Table TABLE

(

ProductKey INT,

SOURCE VARCHAR(20),

SourceIdentifierKey INT,

PeriodKey INT,

CurrencyKey INT,

Quantity INT,

Price DECIMAL(18, 4)

);

INSERT INTO @Table

(ProductKey, SOURCE, SourceIdentifierKey, PeriodKey, CurrencyKey, Quantity, Price)

VALUES

(1, -- SourceAccountKey - int

'A', -- SOURCE - varchar(20)

1, -- SourceIdentifierKey - int

1, -- PeriodKey - int

1, -- CurrencyKey - int

10, -- Quantity - int

15.20 -- Price - decimal(18, 4)

),

(1, -- SourceAccountKey - int

'B', -- SOURCE - varchar(20)

1, -- SourceIdentifierKey - int

1, -- PeriodKey - int

1, -- CurrencyKey - int

17, -- Quantity - int

15.80 -- Price - decimal(18, 4)

),

(2, -- SourceAccountKey - int

'A', -- SOURCE - varchar(20)

1, -- SourceIdentifierKey - int

1, -- PeriodKey - int

1, -- CurrencyKey - int

7, -- Quantity - int

11.20 -- Price - decimal(18, 4)

),

(2, -- SourceAccountKey - int

'B', -- SOURCE - varchar(20)

1, -- SourceIdentifierKey - int

1, -- PeriodKey - int

1, -- CurrencyKey - int

27, -- Quantity - int

10.80 -- Price - decimal(18, 4)

);

;

SELECT

*

FROM

(

SELECT

P.ProductKey,

P.PeriodKey,

P.CurrencyKey,

P.A,

P.B,

'Price for ' + T2.SOURCE AS Source,

T2.Price

FROM

(

SELECT

T.ProductKey,

T.SOURCE,

T.PeriodKey,

T.CurrencyKey,

T.Quantity

FROM

@Table AS T

) AS T1 PIVOT ( SUM(Quantity) FOR SOURCE IN ([A], ) ) AS P

JOIN @Table AS T2

ON P.ProductKey = T2.ProductKey

) AS P1 PIVOT ( AVG(Price) FOR SOURCE IN ([Price For A], [Price For B]) ) AS P1;

If you need to be more dynamic and can't pivot in the reporting layer, you should read the 2 Crosstab & Pivot articles linked in my signature

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Greg Edwards-268690 SSC-Insane Points: 20680 More actions · Answer 6

kevin.obrien 66193 (10/9/2016)
Hi,
How do I allow for an item in SourceA to be matched to an item in SourceB?
So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.
Thanks!

You might consider a table with surrogate key, source code, and sourceID.

When they are matched, you assign the same surrogate key.

The surrogate key is best a key only in the Data Warehouse. You may add other sources, and some Items will not exist in all systems.

This can get rather interesting, as an Item may have several suppliers, each with different suppliers and pricing.

And likely an Item number unique to each supplier.

Be sure to talk through and point out with the users some of the design considerations.

MMartin1 One Orange Chip Points: 27867 More actions · Answer 7

The issue I see is that you can have one product out of many that only uses source D for example, while the rest are varied among the other sources. This case will give the table a swiss cheese appearance with all other products in the table containing null (or the equivalent value) there. If there is no significant relationship between the product and its source, then there is nothing wrong with having a source dimension. This is done in a normal star schema with the concept of a supplier. In short, the attributes of a table should relate to the key field.

Of course... It really comes down to mirroring this business procurement process. If indeed there is a relationship between what provides what product, then that is a valid attribute of a productID.

----------------------------------------------------