January 28, 2014 at 1:50 am
Hi everybody,
Background: we receive every day customer data, including new customer and changes to previous customer. The new customer are inserted in our CRM system using the cozyroc component. The changes in previous customer data must be analysed and only updated in CRM if the changes are relevant.
A relevant change is for us a change in a relevant column. The customer table, where we store the incoming data has 10 columns, and only 5 are relevant to the CRM system. So if we receive updates only in the non relevant columns we should discard these rows in our incremental load.
Initial solution: add a column to store a hash value with the sum of the relevant columns and then calculate the hash value of the incoming data in order to compare those values
hashvalue =
Convert(nvarchar(35),hashbytes('MD5', ISNULL(Col1,'') + ISNULL(Col2, '2000-01-01')+...ISNULL(Col5, '')))
The data type of the hashvalue column is nvarchar(35).
The strange thing is that the stored value are weird, something like " ೢムό䜨غ䟊"
and when new data come the new calculated hashvalue never matches the previous one
The question:
I try with a hashvalue column of type binary(16) and seems to work, but my question is why I cannot used the nvarchar solution?
It is also painful because I cannot compare binary values using a conditional split, and I have to convert again the binary values to DT_WSTR for the comparison.
Any comment would be appreciated.
Kind Regards,
January 28, 2014 at 7:57 am
I'm a bit confused as to why you need the hash value. Why not select those 5 records in your lookup, and test for changes to any of them in your conditional split?
January 28, 2014 at 9:37 am
Paul,
It's isn't it the convert that throwing it off? When I've done this sort of thing before I've always had by hash as a computed column (obviously persisted and indexed appropriately). Such as the simplified example below.
I'm confused as to why you'd do a conditional split, would it not be more performant to do an update based on the key value where the hash doesn't match?
Applologies if I'm missing something.
IF OBJECT_ID('dbo.Emp') IS NOT NULL
DROP TABLE dbo.Emp
GO
CREATE TABLE dbo.Emp
(
Id INT PRIMARY KEY
,Forename VARCHAR(50)
,Surname VARCHAR(50)
,JobTitle VARCHAR(50)
,HashCol AS HASHBYTES('MD5',Forename + Surname + JobTitle)
)
GO
INSERT INTO dbo.Emp
VALUES(1,'Bob','Smith','IT Manager')
,(2,'Harry','Hill','DBa')
,(3,'Chris','Brown','Programmer')
GO
UPDATE dbo.Emp
SET Forename = 'Robert'
WHERE HASHBYTES('MD5', 'BobSmithIT Manager') = HashCol
GO
SELECT * FROM dbo.Emp
GO
January 28, 2014 at 10:09 pm
Perhaps you could try this.
1) Send all of your potential updates to a staging table.
2) Do a Merge from the staging table to perform the update using a source something like
(Select * from staging)
Except
(Select relevant cols from target join staging on col = col)
The except condition should remove the identical rows without fuss.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
January 29, 2014 at 12:50 am
Hi Folks,
Thanks for your answers.
@Nevyn: my case is actually more complex, as an example I said 10 columns, 5 relevant and 5 non relevant, but I have a table with more than 40 columns, with at least 25 relevant columns
@Farlzy: that´s an elegant solution and I can solve almost everything in the database, I mean, more in the SQL Server and less in SSIS.
@Phil: you always give useful answers 🙂 I have chosen your solution, in fact I already have a Staging DB to control the information exchange between our Data Mart and the CRM. I think your solution is not only simple to implement but also has a better performance. I´m a fan of the merge command.
Kind Regards,
January 29, 2014 at 10:30 am
I agree with Phil's approach.
SELECT
EXCEPT
SELECT
is a tried and true way of doing a column specific "table diff". I use hashing as a last resort for extremely large and wide dimensions with many type-2 columns. Personally, I love Todd McDermid's SCD Transform (free on codeplex), but in this situation, simplicity rules...especially if there are not many on your team that know and can support SSIS.
Unfortunately, SQL Server does NOT include the ANSI standard "CORRESPONDING BY" functionality yet....maybe someday....
January 29, 2014 at 12:47 pm
I've done this before, also with the Persisted computed column for only relevant fields, though I usually use a key-checksum combination.
What I typically do is load the key/checksum combination into a lookup component, then feed in changes from the source determined by their delta detection (usually a datetime field and a 'last run' field in my configurations) and determine those that don't match for the combination. In a lookup, the unmatched error rows can be easily re-directed, and those are what I feed to my staging table. The matching rows are allowed to rot in dev/null. 🙂
After that, once they're in the staging structure, I do what Phil recommends and MERGE those I know are changed. It just eases up on the processing and since I have to move the data between source and target anyway, a bit of transformation along the way is far more lightweight to the computations than processing 2-3x the rows on the target.
Just keep the lookup tight. If you're past 30 bytes across or so, it's far too large for easy cache'ing on umpteen million row targets, at least on my equipment.
EDIT: I should mention that yes, I know checksum can have varchar continuation concerns, where 'abc','def' is the same value as 'ab','cdef'. If I can't split my varchars with integers I'll usually append a pipe in the calculation function, so:
CheckSumVal AS CHECKSUM( VCol1, IntCol1, VCol2, '|' + VCol3, '|' + VCol4)
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 29, 2014 at 1:05 pm
There's one issue with using the EXCEPT approach as the data source for a subsequent MERGE: it becomes impossible to tell from within the MERGE if rows have been deleted from the data source (as they are combined with rows that are identical across both the source and target). If that's okay (you may not care about deletes or you may know that the source never has deletes) then fine, but if not, you are left with the problem of dealing with deletes separately.
An alternative is to use MERGE itself to do the work for you. You end up with doing the following:
MERGE INTO B
USING A
ON A.key = B.key
WHEN MATCHED AND (
A.x is null and B.x is not null or
A.x is not null and B.x is null or
A.x <> B.x or
A.y is null and B.y is not null or
A.y is not null and B.y is null or
A.y <> B.y
)
THEN -- update
WHEN NOT MATCHED BY TARGET
THEN -- insert
WHEN NOT MATCHED BY SOURCE
THEN -- delete
Okay, so that's more code you have to write, unless you take into account the additional code you have to write to handle deletes using the other approach.
I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match. If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.
January 29, 2014 at 1:19 pm
Bruce W Cassidy (1/29/2014)
I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match.
A persisted calculated column is relatively lightweight, and doesn't require continuous computation. The key being it's persisted. Also INTs are small.
If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.
True, it depends on change volume expectations of the columns required vs. the columns that aren't transferred. If it's low, I probably wouldn't pursue the removal of them. If it's high removing it entirely is relatively inexpensive.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 29, 2014 at 1:32 pm
Checksum by itself is risky.
Checksum in combination with the business key is less risky.
HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.
Many approaches exist for SCDs depending on data volumes, load windows, and a few other factors.
January 29, 2014 at 4:09 pm
sneumersky (1/29/2014)
HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.
From MSDN:
Return Value
--------------------------------------------------------------------------------
varbinary (maximum 8000 bytes)
If I'm going to possibly compare 8000 bytes I'm usually better off just doing the column compare, thus why I rarely use HASHBYTES.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 30, 2014 at 9:38 am
I've never had HASHBYTES return more than 14 bytes.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply