Best practice approach to surrogate keys for new records in SCD2 processing

Question

Best practice approach to surrogate keys for new records in SCD2 processing

Scott In Sydney

Hall of Fame

Points: 3331
More actions
January 24, 2017 at 4:37 pm

#334461

I'd like your opinion on the best practice for SCD2 processing with respect to surrogate keys. A typical scenario would be a customer number.
Say we have this schema (simplified, I'm not including valid from/valid to dates)
PK (identity column) CustomerNumber (surrogate key) BusinessKey (say FirstName LastName DateOfBirth) SatelliteVariables (whatever triggers an SCD2 change) CurrentRecord
And say an initial load:
1 1001 John A Y 2 1002 Mary B Y 3 1003 Jack C Y
So, the identity column generated the PK, and "some process" generated the customer number (SK) based on the BusinessKey (name) for SCD2 "new" records.
Say the next load has:
Mary D (update) Mark E (new) John F (update) Suri G (new)
The new data could have (at least) two approaches:
1) CustomerNumber (surrogate key) is based on the max previous Customer Number:
1 1001 John A N 2 1002 Mary B N 3 1003 Jack C Y 4 1002 Mary D Y 5 1004 Mark E Y <<< max (1003) + 1 6 1001 John F Y 7 1005 Suri G Y <<< max (now 1004) + 1
So, the previously assigned CustomerNumber is carried forward on the change record, and new records get assigned the "max+1" CustomerNumber
2) CustomerNumber (surrogate key) is based on the PK (resulting in gaps in the CustomerNumber)
1 1001 John A N 2 1002 Mary B N 3 1003 Jack C Y 4 1002 Mary E Y 5 1005 Mark D Y <<< PK (5) + offset (1000) 6 1001 John F Y 7 1007 Suri G Y <<< PK (7) + offset (1000)
So, the previously assigned CustomerNumber is carried forward on the change record, and new records get derived from the PK
Questions:
1) What is the best practice? Or is each approach equivalent? Should I even care about "gaps" in the CustomerNumber?
2) For a very large table, would #2 perform better than #1?
3) Based on your best practice, what approach do you use to implement this? Insert trigger? Function+Default Value? Something else?
4) Do you have any example code to get me started?
(Note: I'm trying to adopt the code from https://www.sqlservercentral.com/Forums/Topic1851256.aspx to implement #1, although this use case is a bit more complex. #2 should be simple to implement. Regardless, I wanted to post this to get your thoughts on best practice.)
Apologies if this is a noob question. I Googled "sql server scd2 surrogate key customer number" before posting. I'm happy to be pointed to a link and educate myself, rather than being spoon fed, if you'll point me in the right direction.
(The Customer Number analogy is just an example. My actual scenario is a single table, not a full fledged dimensional data warehouse)
Thanks...

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

matak SSCrazy Points: 2734 More actions · Answer 1

One thing i will recommend is to grab a copy of The Data Warehouse Toolkit.
This should answer your questions around this.
Its a great book and well worth it.

Scott In Sydney Hall of Fame Points: 3331 More actions · Answer 2

This sort of works for #1, but I'm not sure it's best practice. It's moving some of my application logic into the database layer - I'm not sure if that's a good thing or not.

ALTER TABLE [tmp].[test] DROP CONSTRAINT [DF_test_cr] GO

ALTER TABLE [tmp].[test] DROP CONSTRAINT [DF_test_SK] GO

DROP TABLE [tmp].[test] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [tmp].[test]( [PK] [int] IDENTITY(1,1) NOT NULL, [SK] [int] NOT NULL, [BK] [varchar](10) NOT NULL, [sv] [varchar](10) NULL, [cr] [char](1) NOT NULL, CONSTRAINT [PK_test] PRIMARY KEY CLUSTERED ( [PK] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

GO

ALTER TABLE [tmp].[test] ADD CONSTRAINT [DF_test_SK] DEFAULT ((-1)) FOR [SK] GO

ALTER TABLE [tmp].[test] ADD CONSTRAINT [DF_test_cr] DEFAULT ('Y') FOR [cr] GO

CREATE TRIGGER tmp.trgTest_UpdateSK ON tmp.test INSTEAD OF INSERT AS BEGIN DECLARE @MaxSK INT SET @MaxSK = (SELECT COALESCE(MAX(SK),0) FROM tmp.test)

;WITH Inserts AS ( SELECT @MaxSK + ROW_NUMBER() OVER (ORDER BY t.PK) AS SK ,i.BK ,i.sv FROM INSERTED i LEFT JOIN tmp.test t ON i.BK=t.BK WHERE t.BK IS NULL ) ,Updates AS ( SELECT t.SK ,i.BK ,i.sv FROM INSERTED i LEFT JOIN tmp.test t ON i.BK=t.BK WHERE t.BK IS NOT NULL AND t.cr='Y' ) INSERT INTO tmp.test (SK,BK,sv) SELECT * FROM Updates UNION SELECT * FROM Inserts END GO

INSERT INTO tmp.test (BK,sv) VALUES ('John','A'),('Mary','B'),('Jack','C') GO SELECT * FROM tmp.test t ORDER BY t.BK,t.PK

INSERT INTO tmp.test (BK,sv) VALUES ('Mary','D'),('Mark','E'),('John','F'),('Suri','G') GO SELECT * FROM tmp.test t ORDER BY t.BK,t.PK

INSERT INTO tmp.test (BK,sv) VALUES ('John','H'),('Sara','I'),('Mark','J'),('Walt','K') GO SELECT * FROM tmp.test t ORDER BY t.BK,t.PK

But I don't understand why this fails? (code excerpt only):

;WITH Inserts AS (
SELECT
1000 + @MaxSK + ROW_NUMBER() OVER (ORDER BY t.PK) AS SK
,i.BK
,i.sv
FROM
INSERTED i
LEFT JOIN
tmp.test t
ON
i.BK=t.BK
WHERE
t.BK IS NULL
)

Scott In Sydney Hall of Fame Points: 3331 More actions · Answer 3

From the paucity of replies, perhaps the original question belonged in a data warehousing forum rather than here.

In my actual application, I've got an application layer (SAS) conducting ETL against the database layer (SQL Server) via ODBC. This is analogous to any ETL tool or .Net providing the application layer against the SQL Server database layer.

I've gone with approach #2 from my original post, with gaps in the CustomerNumber, primarily because it was easier to implement, but also because when CustomerNumber = PK, I know that was the initial record (a nice to have only - it could also be derived based on minimum ValidFrom date).

The bulk of the logic is in the application layer. The application layer code knows the difference between a new record (no existing CustomerNumber) and an update record (existing CustomerNumber), and can inject the existing CustomerNumber into the INSERT statement.

The AFTER INSERT trigger is simply:

CREATE TRIGGER [content].[trgUpdateRecnum] ON [content].[COD_URF] AFTER INSERT AS BEGIN SET NOCOUNT ON UPDATE t SET t.RECNUM = t.COD_URF_ID FROM [content].[COD_URF] t INNER JOIN INSERTED i ON t.COD_URF_ID = i.COD_URF_ID WHERE t.RECNUM = -1 END

I've read some blogs recommending against the use of triggers, but in this case it seems the easiest approach.

(I've got "The Data Warehouse Toolkit, Second Edition" from a colleague, but a review of the book's index on "Slowly Changing Dimensions" does not indicate a best practice approach to my OP).