Problem: Slowly Changing Dimension - 2 rows with the same PK

Question

Problem: Slowly Changing Dimension - 2 rows with the same PK

guenther.lackner

Old Hand

Points: 356
More actions
August 26, 2009 at 9:38 am

#209923

Hi all!
I use the Slowly Changing Dimension to import new or update existing datasets out of a CVS flatfile. This works pretty fine but there is one problem:
If in the same file, two dataset occur with the same primary key which is not existant in the database, the Slowly Changing Dimension is creating a new entry with the information of the first row - so far so good.
The following row should then be an update for this newly created dataset but the Slowly Changing Dimension trys to insert it as a new row and causes a PK violation...
Is there some way to prevent this???
best regards
Günther

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

John Rowan SSC Guru Points: 56440 More actions · Answer 1

Try putting a Sort transformation directly before your SCD Transformation. Inside the Sort, configure it to sort on the PK value and check the 'Remove rows with duplicate sort values' check box.

edit...

This should work if you want to use the last row in the file for that PK as the most recent row.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

guenther.lackner Old Hand Points: 356 More actions · Answer 2

Hi!

Unfortunately I need to import both rows... Is there a way to tell teh Slowly Changing Dimension to complete the transaction and open the table again after each row?

Günther

John Rowan SSC Guru Points: 56440 More actions · Answer 3

You can't import both rows if they have the same Primary Key. Even if the rows were imported in different transactions, it will fail because of the Primary Key constraint. Since both rows represent the same data, you need to make a decision on which row's values overrides the rest. Once you've done that, we can talk about solutions.

So if your data file looked like this:

PK_Colum Col1 Col2

1, Col1Val, Col2Val

2, value, value2

1, updatedVal, updatedVal

Which values for PK value 1 would you want to see in your database?

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

guenther.lackner Old Hand Points: 356 More actions · Answer 4

Hi!

The import files represent the change history of an item. Generally the last line of an item (same PK) in the file should contain the actual version of the dataset.

So it should be okay if the last line is imported...

lacky

Phil Parkin SSC Guru Points: 246965 More actions · Answer 5

John Rowan's first response gives you a solution to this, it would seem.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Paul Williams SSCommitted Points: 1930 More actions · Answer 6

The assumptions been made in the replies that the PK of the data source is the PK in the Dimension which normally it wouldn't be. The PK source acts as the business key and the Dimension will have its own surrogate PK.

Theoretically therefore your two rows can be imported, however it appears the SCD transform does not work this way.

You could try not using the SCD transform and code the dimension changes yourself (we use a script task to check for changes and insert, update, delete the Type1, 2 changes as the SCD that comes out of the box has performance issues on large datasets).

If you want to persevere with the SCD then the only way to do it is filter out the duplicate rows at the start to another table, pass the first set of data through the SCD, then the second set.

Paul R Williams.

John Rowan SSC Guru Points: 56440 More actions · Answer 7

Paul brings up an interesting point in that the business key from the file would not normally be the PK in the table. This whole thing may not happen in a table with a surrogate key. With that said, using a surrogate key on the table may solve the problem. I'd like to know. I don't have time to mock this out right now, but I may later. If someone else can, please do.

I typically try to make sure that when I use a SCD inside a data flow, that the data source ensures that there are not duplicates. Subsequent runs of the data flow can contain duplicates, hence the need for a SCD transformation, but having the data source pull distinct data may not be necessary.

Günther,

Are you using a surrogate key on your dimension? Have you set some of the attributes to Historical or Changing? Can you post your table DDL?

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Paul Williams SSCommitted Points: 1930 More actions · Answer 8

That is true.

I'm surprised the SCD behaves in this way when I think about it because unless a DW is a real-time feed, it is quite common for the source data to be batched up and so contain duplicates on the PK, but not duplicate rows as the duplicate key rows contain the changes to the non-business keys.

The batched up data would then be fed into the DW.

As JR points out - whether they SCD has been setup for Type 1 or Type2 changes would be useful to know, but as I've said we don't use the out of the box SCD transform. I'd have to set something up and see if I get the same behaviour.

Paul R Williams.

guenther.lackner Old Hand Points: 356 More actions · Answer 9

Hi all!

Thx for your answeres. I think this behavior of the SCD is a bug.

The idea of moving all duplicates to a new table and import them afterwards would be a solution if no PK occurs three times... I can not rule out this case. I think I will have to go with the skript task and maybe use a stored procedure for this task...

best regards

lacky