January 24, 2017 at 4:37 pm
I'd like your opinion on the best practice for SCD2 processing with respect to surrogate keys. A typical scenario would be a customer number.
Say we have this schema (simplified, I'm not including valid from/valid to dates)
PK (identity column) CustomerNumber (surrogate key) BusinessKey (say FirstName LastName DateOfBirth) SatelliteVariables (whatever triggers an SCD2 change) CurrentRecord
And say an initial load:
1 1001 John A Y
2 1002 Mary B Y
3 1003 Jack C Y
So, the identity column generated the PK, and "some process" generated the customer number (SK) based on the BusinessKey (name) for SCD2 "new" records.
Say the next load has:
Mary D (update)
Mark E (new)
John F (update)
Suri G (new)
The new data could have (at least) two approaches:
1) CustomerNumber (surrogate key) is based on the max previous Customer Number:
1 1001 John A N
2 1002 Mary B N
3 1003 Jack C Y
4 1002 Mary D Y
5 1004 Mark E Y <<< max (1003) + 1
6 1001 John F Y
7 1005 Suri G Y <<< max (now 1004) + 1
So, the previously assigned CustomerNumber is carried forward on the change record, and new records get assigned the "max+1" CustomerNumber
2) CustomerNumber (surrogate key) is based on the PK (resulting in gaps in the CustomerNumber)
1 1001 John A N
2 1002 Mary B N
3 1003 Jack C Y
4 1002 Mary E Y
5 1005 Mark D Y <<< PK (5) + offset (1000)
6 1001 John F Y
7 1007 Suri G Y <<< PK (7) + offset (1000)
So, the previously assigned CustomerNumber is carried forward on the change record, and new records get derived from the PK
Questions:
1) What is the best practice? Or is each approach equivalent? Should I even care about "gaps" in the CustomerNumber?
2) For a very large table, would #2 perform better than #1?
3) Based on your best practice, what approach do you use to implement this? Insert trigger? Function+Default Value? Something else?
4) Do you have any example code to get me started?
(Note: I'm trying to adopt the code from https://www.sqlservercentral.com/Forums/Topic1851256.aspx to implement #1, although this use case is a bit more complex. #2 should be simple to implement. Regardless, I wanted to post this to get your thoughts on best practice.)
Apologies if this is a noob question. I Googled "sql server scd2 surrogate key customer number" before posting. I'm happy to be pointed to a link and educate myself, rather than being spoon fed, if you'll point me in the right direction.
(The Customer Number analogy is just an example. My actual scenario is a single table, not a full fledged dimensional data warehouse)
Thanks...
January 24, 2017 at 5:47 pm
One thing i will recommend is to grab a copy of The Data Warehouse Toolkit.
This should answer your questions around this.
Its a great book and well worth it.
January 25, 2017 at 12:06 am
This sort of works for #1, but I'm not sure it's best practice. It's moving some of my application logic into the database layer - I'm not sure if that's a good thing or not.
ALTER TABLE [tmp].[test] DROP CONSTRAINT [DF_test_cr]
GO
ALTER TABLE [tmp].[test] DROP CONSTRAINT [DF_test_SK]
GO
DROP TABLE [tmp].[test]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [tmp].[test](
[PK] [int] IDENTITY(1,1) NOT NULL,
[SK] [int] NOT NULL,
[BK] [varchar](10) NOT NULL,
[sv] [varchar](10) NULL,
[cr] [char](1) NOT NULL,
CONSTRAINT [PK_test] PRIMARY KEY CLUSTERED
(
[PK] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [tmp].[test] ADD CONSTRAINT [DF_test_SK] DEFAULT ((-1)) FOR [SK]
GO
ALTER TABLE [tmp].[test] ADD CONSTRAINT [DF_test_cr] DEFAULT ('Y') FOR [cr]
GO
CREATE TRIGGER tmp.trgTest_UpdateSK
ON tmp.test
INSTEAD OF INSERT
AS
BEGIN
DECLARE @MaxSK INT
SET @MaxSK = (SELECT COALESCE(MAX(SK),0) FROM tmp.test)
;WITH Inserts AS (
SELECT
@MaxSK + ROW_NUMBER() OVER (ORDER BY t.PK) AS SK
,i.BK
,i.sv
FROM
INSERTED i
LEFT JOIN
tmp.test t
ON
i.BK=t.BK
WHERE
t.BK IS NULL
)
,Updates AS (
SELECT
t.SK
,i.BK
,i.sv
FROM
INSERTED i
LEFT JOIN
tmp.test t
ON
i.BK=t.BK
WHERE
t.BK IS NOT NULL AND t.cr='Y'
)
INSERT INTO tmp.test
(SK,BK,sv)
SELECT * FROM Updates
UNION
SELECT * FROM Inserts
END
GO
INSERT INTO tmp.test
(BK,sv)
VALUES
('John','A'),('Mary','B'),('Jack','C')
GO
SELECT * FROM tmp.test t ORDER BY t.BK,t.PK
INSERT INTO tmp.test
(BK,sv)
VALUES
('Mary','D'),('Mark','E'),('John','F'),('Suri','G')
GO
SELECT * FROM tmp.test t ORDER BY t.BK,t.PK
INSERT INTO tmp.test
(BK,sv)
VALUES
('John','H'),('Sara','I'),('Mark','J'),('Walt','K')
GO
SELECT * FROM tmp.test t ORDER BY t.BK,t.PK
But I don't understand why this fails? (code excerpt only):
;WITH Inserts AS (
SELECT
1000 + @MaxSK + ROW_NUMBER() OVER (ORDER BY t.PK) AS SK
,i.BK
,i.sv
FROM
INSERTED i
LEFT JOIN
tmp.test t
ON
i.BK=t.BK
WHERE
t.BK IS NULL
)
January 30, 2017 at 4:43 pm
From the paucity of replies, perhaps the original question belonged in a data warehousing forum rather than here.
In my actual application, I've got an application layer (SAS) conducting ETL against the database layer (SQL Server) via ODBC. This is analogous to any ETL tool or .Net providing the application layer against the SQL Server database layer.
I've gone with approach #2 from my original post, with gaps in the CustomerNumber, primarily because it was easier to implement, but also because when CustomerNumber = PK, I know that was the initial record (a nice to have only - it could also be derived based on minimum ValidFrom date).
The bulk of the logic is in the application layer. The application layer code knows the difference between a new record (no existing CustomerNumber) and an update record (existing CustomerNumber), and can inject the existing CustomerNumber into the INSERT statement.
The AFTER INSERT trigger is simply:
CREATE TRIGGER [content].[trgUpdateRecnum]
ON [content].[COD_URF]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON
UPDATE t
SET
t.RECNUM = t.COD_URF_ID
FROM
[content].[COD_URF] t
INNER JOIN
INSERTED i
ON
t.COD_URF_ID = i.COD_URF_ID
WHERE
t.RECNUM = -1
END
I've read some blogs recommending against the use of triggers, but in this case it seems the easiest approach.
(I've got "The Data Warehouse Toolkit, Second Edition" from a colleague, but a review of the book's index on "Slowly Changing Dimensions" does not indicate a best practice approach to my OP).
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply