November 10, 2012 at 10:32 am
I have a scenario where an excel have thousands of records with say 20 columns. In which their is a column which I need to split while inserting it into main table and also need to perform other operations as well.
I am using a staging table to import the data from excel and then performing rest of operations on the staging table before inserting the records to main table.
So what should be better approach from performance point of view - derived columns transformation or should I split it in a stored proc which is dealing with other operations as well.
Please advise.
November 11, 2012 at 7:40 pm
rhythm.varshney (11/10/2012)
I have a scenario where an excel have thousands of records with say 20 columns. In which their is a column which I need to split while inserting it into main table and also need to perform other operations as well.I am using a staging table to import the data from excel and then performing rest of operations on the staging table before inserting the records to main table.
So what should be better approach from performance point of view - derived columns transformation or should I split it in a stored proc which is dealing with other operations as well.
Please advise.
No advise 🙁
Need more information or ???
November 11, 2012 at 9:25 pm
So you are importing the column to your staging table in it's native format and then splitting it from there? On what basis are you splitting it (i.e. at a certain column #, cerain character, etc.)? We need some more info to answer but in general I would think SSIS will be better at this kind of operation than T-SQL.
November 12, 2012 at 1:00 am
It depends on the operations as Randy mentioned - it needs to be possible in the SSIS expression language - but if it can, SSIS will be better, as you will do everything in one step. With the staging table, you do everything in two steps so that will probably be more time consuming.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
November 12, 2012 at 10:45 am
Randy Knight (11/11/2012)
So you are importing the column to your staging table in it's native format and then splitting it from there? On what basis are you splitting it (i.e. at a certain column #, cerain character, etc.)? We need some more info to answer but in general I would think SSIS will be better at this kind of operation than T-SQL.
yes.I am importing the data to staging table as it is and there I am applying split on a column say name into first name and last name on the basis of space between two.
so you meant too say derived column would be better approach ??
November 12, 2012 at 1:14 pm
Yes. Such a split is easy to do in the derived column and it saves you the staging step.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
November 12, 2012 at 2:27 pm
rhythm.varshney (11/12/2012)
yes.I am importing the data to staging table as it is and there I am applying split on a column say name into first name and last name on the basis of space between two.so you meant too say derived column would be better approach ??
Yes. The SSIS expression language is going to be much easier to do this in. And while it is a bit of a paradigm shift, get away from the staging table idea. We want to be doing ETL (Extract, Tranform, Load) rather than ELT (Extract, Load, Transform). SSIS excels at doing these things in memory, in the pipeline. Writing to staging table(s) in the interim slows things down considerably. When things get overly complex for a single dataflow, I will often write to raw files for the interim step rather than a db table ... still much faster.
November 12, 2012 at 2:34 pm
Randy Knight (11/12/2012)
rhythm.varshney (11/12/2012)
yes.I am importing the data to staging table as it is and there I am applying split on a column say name into first name and last name on the basis of space between two.so you meant too say derived column would be better approach ??
Yes. The SSIS expression language is going to be much easier to do this in. And while it is a bit of a paradigm shift, get away from the staging table idea. We want to be doing ETL (Extract, Tranform, Load) rather than ELT (Extract, Load, Transform). SSIS excels at doing these things in memory, in the pipeline. Writing to staging table(s) in the interim slows things down considerably. When things get overly complex for a single dataflow, I will often write to raw files for the interim step rather than a db table ... still much faster.
Right up until you need to sort data for aggregations or the like, I agree with you. When you need to do that, T-SQL tends to blow the dataflow stream away. I agree with the general principal though. In a staging table you need to write at least twice (once in, once transformed), where if you transform before you write, it's a single dropoff.
However, don't completely throw away the idea of a staging table. You cannot UPDATE or DELETE from an existing table from SSIS in bulk. If you're doing a MERGE equivalent and not merely delivering new rows to the existing table, deliver to staging and then MERGE it in. That's still part of the LOAD components of ETL though, so I may just be splitting hairs with what Randy was recommending.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 12, 2012 at 2:37 pm
Evil Kraig F (11/12/2012)
Right up until you need to sort data for aggregations or the like, I agree with you. When you need to do that, T-SQL tends to blow the dataflow stream away. I agree with the general principal though. In a staging table you need to write at least twice (once in, once transformed), where if you transform before you write, it's a single dropoff.However, don't completely throw away the idea of a staging table. You cannot UPDATE or DELETE from an existing table from SSIS in bulk. If you're doing a MERGE equivalent and not merely delivering new rows to the existing table, deliver to staging and then MERGE it in. That's still part of the LOAD components of ETL though, so I may just be splitting hairs with what Randy was recommending.
No disagreement here. In fact, one of the things I run into a lot is BI developers who forget about using the strengths of the relational engine (or more likely don't know them). You know what they say about generalizations ...
November 13, 2012 at 6:46 pm
Evil Kraig F (11/12/2012)
Randy Knight (11/12/2012)
rhythm.varshney (11/12/2012)
yes.I am importing the data to staging table as it is and there I am applying split on a column say name into first name and last name on the basis of space between two.so you meant too say derived column would be better approach ??
Yes. The SSIS expression language is going to be much easier to do this in. And while it is a bit of a paradigm shift, get away from the staging table idea. We want to be doing ETL (Extract, Tranform, Load) rather than ELT (Extract, Load, Transform). SSIS excels at doing these things in memory, in the pipeline. Writing to staging table(s) in the interim slows things down considerably. When things get overly complex for a single dataflow, I will often write to raw files for the interim step rather than a db table ... still much faster.
Right up until you need to sort data for aggregations or the like, I agree with you. When you need to do that, T-SQL tends to blow the dataflow stream away. I agree with the general principal though. In a staging table you need to write at least twice (once in, once transformed), where if you transform before you write, it's a single dropoff.
However, don't completely throw away the idea of a staging table. You cannot UPDATE or DELETE from an existing table from SSIS in bulk. If you're doing a MERGE equivalent and not merely delivering new rows to the existing table, deliver to staging and then MERGE it in. That's still part of the LOAD components of ETL though, so I may just be splitting hairs with what Randy was recommending.
Yes, actually if you see my comments I mentioned that I am performing UPINSERT logic on the basis of staging table which hold around 50 K records daily basis.So was thinking to handle the split on the same.As was having the staging table and did not want to insert derived columns additionally.But not sure about the performance.
November 13, 2012 at 7:59 pm
rhythm.varshney (11/13/2012)
Yes, actually if you see my comments I mentioned that I am performing UPINSERT logic on the basis of staging table which hold around 50 K records daily basis.So was thinking to handle the split on the same.As was having the staging table and did not want to insert derived columns additionally.But not sure about the performance.
I would still use the data flow for the inserts. Do a conditional split to a staging table with just the rows that need to be updated and go ahead and do your inserts in the pipeline. Then you can use an Execute SQL Task to do a set-based update. The alternative would be an OLE DB Command to do your updates but that will be RBAR. So this is one case where a staging table makes a lot of sense.
November 13, 2012 at 11:53 pm
Randy Knight (11/13/2012)
rhythm.varshney (11/13/2012)
Yes, actually if you see my comments I mentioned that I am performing UPINSERT logic on the basis of staging table which hold around 50 K records daily basis.So was thinking to handle the split on the same.As was having the staging table and did not want to insert derived columns additionally.But not sure about the performance.I would still use the data flow for the inserts. Do a conditional split to a staging table with just the rows that need to be updated and go ahead and do your inserts in the pipeline. Then you can use an Execute SQL Task to do a set-based update. The alternative would be an OLE DB Command to do your updates but that will be RBAR. So this is one case where a staging table makes a lot of sense.
That is the case I can not use set based approach as updates may be on different columns and with different pattern and so.Hence will have to use RBAR. So what would be the recommendation for my requirement i.e. I need to process 50K records daily basis from an excel to database table and I need to use UPINSERT logic along with few splits on certain columns.
So far I used a staging table for this and handling both on the same.However looking for better approach like derived column transformation or else but at the same time do not want to increase overhead on SQL engine.
November 14, 2012 at 9:15 am
You can do everthing you want to in the data flow. Conditional split to separate the inserts and updates, and if you truly do need to do your updates RBAR then an OLE DB Command for those.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply