January 5, 2009 at 7:26 am
Hi Sachin
Being a DW developer, I am always on the lookout for the simplest way of doing things. Another suggestion/question would be (if I understand the scenario correctly), why not add an intelligent key column... something like checksum(col1 + col2 + col3... + coln). Then, just add the same key logic to your source data query, and in the look-up to compare one set of keys? That way you are passing the logic out of SSIS, and to T-SQL, where the queries will be more malleable and agile.
Great article though... Please post more of these "out-of-the-box" articles!!
Thanks.
January 5, 2009 at 7:52 am
Hi Simon
I am not a .NET guy, so there may be better approach for caching the reference data like Dictionary object as you said.
You may want to refer to http://www.eggheadcafe.com/articles/20030830.asp for some inputs on Find method.
HTH!
Regards,
Sachin Dedhia
January 5, 2009 at 7:57 am
Zanonil,
Can you elaborate more on CHECKSUM? An example would help.
Regards,
Sachin Dedhia
January 5, 2009 at 8:16 am
Hi dphillips,
I appreciate your observation on the "nomenclature"! And as far as the article is concerned, thanks to Steve for his valuable comments.
Not sure why the approach is RBAR approach? In almost all cases, ETL stuff would involve transformation that work on row level data.
While I agree with you on having a normalized database structure, my requirement does not require much to do with the normalized data. If normalized, I would have to again de-normalize the data for generating a file as a feed to another application.
Regards,
Sachin Dedhia
January 5, 2009 at 9:24 am
Sachin Dedhia (1/5/2009)
Not sure why the approach is RBAR approach? In almost all cases, ETL stuff would involve transformation that work on row level data.
Being an RBAR code process is strictly separate from the fact that this may or may not be your fastest execution path for your given needs. I was not speaking to speed, but rather flexibility and re-useability. Your code component is fast, but not dynamic.
An ETL plan doesn't have to process via SSIS one row at a time, nor one column at a time, especially if you are dealing with something that can and may change (such as number of course columns). It is usually more prudent to seek instead for an approach where one does not have to modify code and re-deploy to the package server to process the next file, or worse yet, the nasty scenario of having a different package for each source when all are doing the same type of task.
If your incoming format will remain stable (a rare thing indeed), then it likely doesn't matter. However, if it can change, then ETL-ing into a database, and using the SQL engine to process the matches by making SSIS call the appropriate procedures, etc., is far more flexible, and still stays within the realm of SSIS ETL.
I have also learned (the hard way) that the data is better kept for other purposes. Rare indeed are the projects I've come accross that serve an "enrichment" or "counts" only purpose (one example of enrichment is with data cleansing projects against a third-party tool, during importing and normalizing incoming lead data).
Your case may be just such, however, so your mileage may vary. But in principle, the approach as it stands has custom code, making it flex only by brute code force, even for simple changes such as more (or less) columns of the same type, giving it a high probability for failure over time.
Output (usually the fastest step in the process) doesn't preclude one appending the current file, writing a new one in-process, or even an SSIS task that stuffs data from the database to a file.
I have to admit that dumping en-masse notes about all the courses in the row makes me shudder and question the process as a whole.
I do think the article is useful however on several points, as your article was about the use of the component, and not necessarily the particular data process.
January 5, 2009 at 11:47 pm
Hi Sachin
As requested, here is an example, but first, just a bit of waffling...
In the code, I create a dummy temp table, and put some random values in. My last column however is a calculated column, using checksum, which generates a key to do the joining on. It does not have to be a checksum - you can use whatever logic you like to build a key (Checksum has some flaws,in that there are cases where it may actually returns duplicates). Viz, in the example below, you could just as well have concatenated the strings, instead of using checksum.
The code doesn't have to be this simple either - you can write update statements to rather generate a key if the logic is quite complex. The main thing is, you must just be sure that the values you are using to create a key with, are unique. That way, you will always just have to join on one column, instead of many.
Here's the code (looks a bit better if you copy it into SSMS):
/* Check if the temp table already exists*/
if OBJECT_ID('tempdb..#temp') is not null
Drop table #Temp
/* Create a temp table */
Create table #Temp
(
RowIDint identity (1,1)
,Commentvarchar(200) NULL
,Column1varchar(200) NULL
,Column2varchar(200) NULL
,Column3varchar(200) NULL
,Key_Columnas Checksum(Column1, Column2 , Column3)
)
go
/* Insert Random Values */
Insert into #Temp
(
Comment
,Column1
)
Select
'Show example of value for column 1 only'
,'Some Value'
go
Insert into #Temp
(
Comment
,Column1
)
Select
'Show example of another value for column 1 only'
,'Some other value'
go
Insert into #Temp
(
Comment
,Column3
)
Select
'Show example of value for column 3 only'
,'Some value'
go
Insert into #Temp
(
Comment
,Column1
,Column2
,Column3
)
Select
'Show example of value for all columns only'
,'Some Value'
,'Another Value'
,'Yet Another Value'
go
/* Get the data */
Select*
from#Temp
January 8, 2009 at 10:34 am
Great article.
Very nicely explained and a great concept.
Rob.
Viewing 7 posts - 16 through 21 (of 21 total)
You must be logged in to reply to this topic. Login to reply