Surrogate lookup in DWH incidentally fails using T-sql (esp. on large datasets)

Question

Post reply

Surrogate lookup in DWH incidentally fails using T-sql (esp. on large datasets)

giovi

Mr or Mrs. 500

Points: 588
More actions
November 14, 2009 at 6:30 pm

#385793

Within a small DWH project I'm performing lookups of dimension surrogates using T-SQL lookups
SK=surrogate, BK= business key
SK_Organization BK_Organization Rowstartdate Rowenddate
-1----------------NULL-------------01-01-2001----NULL
1-----------------5----------------01-01-2004----NULL
2-----------------6----------------01-01-2004----NULL
(the -1 represents the unknown value , when a BK value doesn't exist, for instance 102, the t-sql lookup simply uses coalesce(@result,-1) in order to assign an unknown. )
Well, within the dimension table, a unique index exists on the surrogate key (not on the business key because it holds a null instance).
Next to this I have a staging table which contains the business key.
Within the staging I assign the surrogate at SK_Organizationkey using a stored procedure, which calls the lookup function and is filled with the business key.
When handling large datasets, some rows, without a pattern, are associated a -1 value at the SK while having a valid business key.
When taking the business key of the invalid row, and executing it in the lookup function, without the procedure, a valid value occurs rather than the invalid -1.
So...function is correct.
I could give you all the programmation , but as having testing this about 10 times I can tell you there's something going on in the configuration, and I hope someone over can help me.
Or..I should use some hint in the T-SQL..? I've already put it serializable
Think of "large datasets", "update statement", "T-SQL lookups", "assign invalid values incidentally"

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

I don't want to sound offending but there still might be something wrong with the sp.

The way you describe it it sounds like you're using a cursor which might be the root cause for the wrong insert/update.

Without knowing the structure of your sp all I can say is: SQL Server is not really known for "assign invalid values incidentally". Usually, SQL Server does exactly what it's told to do. But that's not always the same as someone actually intended to do. 😉

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

giovi Mr or Mrs. 500 Points: 588 More actions · Answer 2

Thank you IMU, i think incidental things might occur only when something in the configuration might be wrong unless code is wrong.

Hereby the anatomy of the stored procedure, would be helpful if you could take a look.

It's not cursor based

This is from my test bench which only assigns the organizationkey. And well..at the test bench also the odd -1 occurs

The etl.fn_Lookup_SK_OrganizationKey_Costcenter_Admunit is the function which is filled with the applicable fields from table staging.postedhourline

ALTER PROCEDURE [etl].[Usp_Up_Serviceledger_Surrogates]

/*

Author:

Creation date:

Modification date:

*/

AS

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

--SK_Organizationkey

BEGIN TRAN

UPDATE staging.[Postedhourline]

SETSK_Organizationkey=(

select etl.fn_Lookup_SK_OrganizationKey_Costcenter_Admunit(

[Business unit]

,[Employee department code]

,getdate()

,NULL

)

FROM staging.[Postedhourline]

COMMIT TRAN

LutzM SSC Guru Points: 107049 More actions · Answer 3

Would you mind sharing the code for the function etl.fn_Lookup_SK_OrganizationKey_Costcenter_Admunit as well?

Maybe it's possible to change it into a normal join...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

giovi Mr or Mrs. 500 Points: 588 More actions · Answer 4

Hi,

Well, the function calls 2 other functions. Sadly it won't be possible to put it in one join because of the logic inside. The logic covers three tables, and if a certain value is true in one table for the underlying fact data, the handling is different.

The same things can happen to the cost center which has its surrogate looked up. In contrary to the organizationkey, the surrogate of the cost center is really simple. And maybe that one applies to be handled into a join

The function is much alike the other one but uses 3 instead of 4 fields.

etl.fn_lookup_costcenter ([global code 1]

,getdate()

,NULL)

ALTER FUNCTION [etl].[fn_Lookup_SK_Costcenter]

(

@costcenternvarchar(20)

,@Startdatedatetime

,@Enddatedatetime

)

RETURNS bigint

AS

BEGIN

-- Declare the return variable here

DECLARE @Result bigint

-- Add the T-SQL statements to compute the return value here

SELECT @Result =(

SELECTSK_Costcenter

FROM[DWH].[dbo].tbl_Dim_CostCenter

WHERErtrim(BK_Costcenter)=@Costcentercode AND

RowStartDate<= coalesce(@Startdate,getdate())And

coalesce(RowEndDate,getdate()) >=coalesce(@Enddate,getdate())

)

-- Return the result of the function

RETURN coalesce(@Result,-1)

END

LutzM SSC Guru Points: 107049 More actions · Answer 5

This one actually does look like it can be included without a function.

Regarding the other ones: I still don't know.

If you need to do a different handling the CASE statement might help.

Again: until now there's nothing that would make the server "assign invalid values incidentally".

But since you can't (or won't) show the other functions it's still guessing...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

giovi Mr or Mrs. 500 Points: 588 More actions · Answer 6

Hi Lutz,

The costcenter also makes mistakes so let's start of with this one.

It's easy yes to join the table.

But for clarity i've used the function as lookups might easily grow more complex, so for clarity that each distinct type of lookup is represented by 1 function I've kept it this way.

Otherwise my logic would still become divergent, which means if I have to change the logic of a lookup throughout the total solution I would have to walk through all procedures which inhabit for instance the costcenter key lookup.

Do you agree, using this function, that this shouldn't go wrong? So putting in a join is just a workaround without finding the cause

My latest test shows the following:

- at the end of the function I had coalesce(@Result,-1)

- I've changed this to coalesce(@result,-2)

When running the procedure, for the flawed records, it gives back -1.

So my conclusion is, for these records, the function receives a NULL value as business key instead of the value of the record. Because -1 is a surrogate in the dimension table which represents the NULL business key value.

Thanks for your help so far!

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 7

Is there an order of operations issue?

I would assume that you would populate your dimension from your staging table not populate your staging table from your dimension. So, I assume that the dimension should be populated before you try to get the surrogate key that you are putting into the staging table? Perhaps you left out the order from your description, but I thought I'd check.

Also, not sure if it matters to you, but I would change the RowEndDate from a nullable column to a non-nullable column and use a terminus date instead of NULL (something like 9999-12-31).

giovi Mr or Mrs. 500 Points: 588 More actions · Answer 8

Hi Lamprey,

The dimension table is always filled prior to loading the fact table.

The logic is handled by a SQL Server SSIS package.

In my testbase the dimension is always the same, so i don't run SSIS , I just use a static set.

I will test with the remark about the rowenddate.

What comes to mind to me, is that the update goes over the total table, each row has a lookup embedded, so wouldn't this be a too stressful situation for the server?

I would think the database engine splits up tasks intelligently to prevent server insanities so I can't imagine such.

It doesn't keep running endlessly , it's done in 1 minute to update about 200000 rows

regards

giovi Mr or Mrs. 500 Points: 588 More actions · Answer 9

Well, typical flaw in many database platforms is handling NULL.

Today I found out that I should place 31/12/9999 in my slowly changing dimension "current records".

Next I had to take out function calls which called the rowenddate range with NULL and had to remove coalesce() 's.

Now it's working sane