July 14, 2010 at 8:53 pm
Thanks for the quick response.
Regards,
Pravasis
September 2, 2010 at 10:22 am
[font="Comic Sans MS"]
This is an excellent article.
However - I am facing a peculier problem that sometime the generated MD5 hashval is NOT same for the same data. This is happening intermittantly - say for 5 days it is working fine and not working next day. Also - I have noticed this to happen only if the data volume is high (5M).
For around first 50K records it generated the hashValue same as previous day (say day 0) - but for rest of records it has generated different hasValues. It shows rest of the 5M records to be delta. Next day - it rectifies itself and create hashValue same as what we had had for day 0.
It looks like:
(columnA) (columnB) (columnC) columnHash
A,B,C, hashValue1 ----> day 0
A,B,C, hashValue2 ----> day 1 (record inserted)
(I forced all insert even if the hashvalue same)
A,B,C, hashValue1 ----> day 2
( I have 16+ columns - combination of varchar, int, datetime. Some of the columns contain Null value and the hashValue column is of nvarchar(35)
I was unable to replicate this from one environment to another. I loaded those 3 days data into TEST and got desired result (generated ~800 delta records).
This got me thinking of any environment related issue? Is it related to some sort of buffer overflow (note around ~50K records generated correct hashvalue)?
I have loaded more than 500 files ranging from 1000 records to 5M reocrds and the problem has happened only 3 times with the files with 5M records. However - I have loaded the 5M file around 20 times without any issue.
Any clue - or suggestion for doing the debug will be highly appreciated.
[/font]
[font="Comic Sans MS"]--
Sabya[/font]
September 2, 2010 at 10:36 am
I've noticed that sometimes in certain character fields this can be a problem. The best solution I've found, and what I'm currently using in my environment is the SSIS Multiple Hash data flow task. It seems to be a little more reliable when dealing with these fields.
The only change you'll have to make is to change the HashValue columns to the varbinary data type of whatever length of whatever hash you choose to use. And you'll need to convert the new binary hash value to text in the conditional split for comparison as the conditional split cannot compare binary values.
September 2, 2010 at 1:44 pm
[font="Comic Sans MS"]
Brett,
Thanks for the quick reply.
Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.
Do you have any idea why this might have happened? Anything related to memory buffer?
Can you provide any alternate code snippet that might help.
[/font]
[font="Comic Sans MS"]--
Sabya[/font]
September 2, 2010 at 1:51 pm
I will do a little more testing on it during this long holiday weekend and see what I might be able to come up with.
One thing you could try is using a different type of hash such as SHA instead of MD5. I provided an example earlier in the thread.
I don't think it's a buffer issue, though it could be if you are experiencing problems at the same threshold every time.
September 2, 2010 at 2:12 pm
[font="Comic Sans MS"]
Thanks!
I am going to do that too!
Meantime I am removing 4 columns on pipeline which will always be fixed string and see if this helps anyway.
Let me thank you once more for this article. I saw it through RSS feed and bookmarked a while ago. It came really handy when we needed to use this 5M vs 5M data compare where any of these 15-16 column value can change. This made the end to end process run in less than 30 mins while it used to take more than 16 hrs!
[/font]
[font="Comic Sans MS"]--
Sabya[/font]
September 3, 2010 at 5:23 pm
Nice article.
Consider adding the checksum column as an INCLUDE column to the PK index on the table. Your milage may vary because:
1) The lookup transformation component only needs to read the PK index, not the actual data rows. [good for this process.]
2) The PK index is larger because of the checksum values. [a minus--hurts storage and any access using the PK.]
3) The PK index entry must be updated with a new checksum value whenever the data gets updated. Since the checksum value is always the same size, this will not fragment the index pages unless you use the "update trigger sets checksum column to null" idea. [a minus for any process that updates the data.]
David Lathrop
DBA
WA Dept of Health
September 23, 2010 at 2:48 pm
i wonder if you can do incremental load from MysQL to SQl server 2008. I used ADO .NET source connecting to MySQL and I could not do connection to Script component. I read another article Anatomy of an Incremental Load and i cannot use lookup transformation editor , it produced "cannot link data because of data type mismatch" error when I try mach MYSQl key to SQL server 2008 table? how I can convert unsigned 4 bite unsigned interger from mySQL to bigint key in MS SQL?
thank you!
September 24, 2010 at 1:59 pm
Nice article, wish you could write SSIS for our team
September 24, 2010 at 9:30 pm
I actually resolved the problem. not sure about performance. I'll add my comments later when I get results. thank you all.
March 19, 2012 at 3:56 pm
awesome article , however, I have a different scenario. What about the source and target tables are really large?
close to 1 billion records? I am not sure if it's realistic to go through each record using script component to
calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would pretty
much suck most of the memory as it contains over 1 billions records.
Any comments on large tables?
thanks
Hui
March 19, 2012 at 4:02 pm
huishi.ca (3/19/2012)
awesome article , however, I have a different scenario. What about the source and target tables are really large?close to 1 billion records? I am not sure if it's realistic to go through each record using script component to
calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would pretty
much suck most of the memory as it contains over 1 billions records.
Any comments on large tables?
thanks
Hui
For extremely large tables it might make more sense to implement change capturing in the source system to limit the number of records you have to work with in the package. If that's not an option this method will only really work with an SSIS box that has a ton of memory. You can modify the caching in the lookup as well to take pressure of the memory.
Also I'd highly recommend using SSIS Multiple Hash instead of this method, it's much easier to incorporate into your ETL infrastructure than a custom script.
June 25, 2012 at 8:39 am
sabyasm (9/2/2010)
[font="Comic Sans MS"]Brett,
Thanks for the quick reply.
Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.
Do you have any idea why this might have happened? Anything related to memory buffer?
Can you provide any alternate code snippet that might help.
[/font]
I am having the same issue with this script. In each buffer, rows with the same values have a different MD5 value to the previous buffer.
I've requested the Multiple Hash component to be installed but in the meantime it would be great to get this to work correctly.
June 25, 2012 at 11:27 am
For testing purposes I created a spreadsheet with 3 columns, Column A, B and C each with the values A B C respectively in ~60,000 rows and used this as my data source.
Using the script below:
' Microsoft SQL Server Integration Services Script Component
' Write scripts using Microsoft Visual Basic 2008.
' ScriptMain is the entry point class of the script.
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Pipeline
Imports System.Text
Imports System.Security.Cryptography
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Private InputBuffer As PipelineBuffer
Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As PipelineBuffer)
InputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub
Public Overrides Sub PreExecute()
MyBase.PreExecute()
'
' Add your code here for preprocessing or remove if not needed
''
End Sub
Public Overrides Sub PostExecute()
MyBase.PostExecute()
'
' Add your code here for postprocessing or remove if not needed
' You can set read/write variables here, for example:
' Me.Variables.MyIntVar = 100
''
End Sub
Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)
Dim counter As Integer = 0
Dim values As New StringBuilder
For counter = 0 To inputBuffer.ColumnCount - 1
Dim value As Object
value = inputBuffer.Item(counter)
values.Append(value)
Next
Row.Concat = values.ToString
Row.HashValue = CreateHash(values.ToString())
End Sub
Public Shared Function CreateHash(ByVal data As String) As String
Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim md5 As MD5 = New MD5CryptoServiceProvider()
Dim hashedData As Byte() = md5.ComputeHash(dataToHash)
RNGCryptoServiceProvider.Create().GetBytes(dataToHash)
Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)
Return s
End Function
End Class
I get the following results:
RowRow 1 onwards
Column AA
Column BB
Column CC
HashValuecWd0tTqSuYUlu6F5oVnMcQ==
ConcatABC
RowRow 11167 onwards
Column AA
Column BB
Column CC
HashValueEP2rhBgzqDEc0qmEjWlQBA==
ConcatABCcWd0tTqSuYUlu6F5oVnMcQ==ABC
RowRow 24194 onwards
Column AA
Column BB
Column CC
HashValuedLHcMhE2RJZ6ew7Jd8oezQ==
ConcatABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC
RowRow 37221 onwards
Column AA
Column BB
Column CC
HashValue93k+mMDI1x5OoZ0cjtz7Hw==
ConcatABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC
RowRow 50248 onwards
Column AA
Column BB
Column CC
HashValueRHGVCjMX9usJb160IvP6RQ==
ConcatABC93k+mMDI1x5OoZ0cjtz7Hw==ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC
RowRow 58881 onwards
Column AA
Column BB
Column CC
HashValuecWd0tTqSuYUlu6F5oVnMcQ==
ConcatABC
Looking at the concat results the previous buffers HashValue is getting include in the InputSource for some reason. When I add the line Row.ColumnCount = InputBuffer.ColumnCount the value is 5 for each row even though I've only selected 3 columns as part of InputSource.
October 30, 2012 at 9:11 am
I have implemented the code suggested and I am running into some trouble where every 3rd or 4th time I run the package, I get the same hash for columns that have not changed. Shouldn't I get the same hash "every" time I run the package if the columns have not changed?
These tables outside of any OLTP environment so there is no change in the source data.
I appreciate any help.
Viewing 15 posts - 46 through 60 (of 66 total)
You must be logged in to reply to this topic. Login to reply