April 20, 2010 at 7:55 am
Hi all,
I'm writing a package to upload csv files into my SQL 2008 database - and while doing so, I'm performing some manipulation on the input of the files to get them "cleaned-up" before they go into the DB.
One of the things I'm doing is some string manipulation, specifically removing substrings using a formula like:
REPLACE(ColumnA, ".end", "")
which I am using as a Derived Column Expression definition in the "Derived Column" editor in the the MSFT Visual Studio. In the final upload, I then map "Derived ColumnA" to the column named "ColumnA".
The problem that I encounter is that this turns ColumnA, which was a non-unicode varchar, and defaults it to a unicode nvarchar. And then the upload doesn't complete because ColumnA is defined as a non-unicode varchar.
I have a couple of questions that arise from this:
1 - I had thought that when possible, you wanted to use non-unicode strings b/c of space and performance. Is this true? When you start getting tables with millions and millions of rows, each of which has a few strings that are coded unicode instead of non-unicode is there going to be a real implication?
2 - If there is a real difference between performance with non-unicode vs unicode - is there any way to tell the derived column editor in the Visual Studio that I want the output string to be non-unicode? After I fill in the expression with the code above, it defaults the Data Type column to "Unicode String [DT_WSTR]".
Any and all insight and suggestions are appreciated!
Thanks! =)
April 20, 2010 at 9:01 am
First of all, there is a penalty to be paid for using nvarchar Vs. varchar. It takes twice as much space in tables, buffers, disk I/O, and any indexes. For very large tables you will take a performance hit. Too bad SQL Server didn't support UTF-8 encoding that would be far more efficient (in terms of storage) but I digress...
I would use CAST or convert to set the data type of your derived column that should do it.
The probability of survival is inversely proportional to the angle of arrival.
April 20, 2010 at 9:18 am
Thanks for the input - I went ahead and added a "Data Conversion" Block to the Data Flow and converted my Derived Column back to a non-unicode string before uploading it.
If anyone else has this same problem - CAST and CONVERT don't seem to be valid functions in the "Derived Column" editor, but the work-around I found was to put a "Data Conversion" downstream of the derived column creation, converting all my strings back to what I wanted them to be.
Also - in SSIS, this is a bit annoying because it doesn't tell you in the "Data Conversion" editor - as far as I can tell, the data type DT_WSTR is UNICODE string (nvarchar), and DT_STR is NON-UNICODE (varchar). If there's a good reason to use a completely different naming convention, it's lost on me.
Anyhow - looks like I've got a working solution, but am still curious on what the full impact of unicode vs non-unicode is in terms of performance/speed. Does anyone have some stats here? Just curious!
Thanks again!
=)
April 20, 2010 at 10:26 am
The easier and more efficient way to accomplish what you want isn't by using a separate Data Conversion component. Just change the expression a little in the Derived Column to cast it to a DT_STR rather than a DT_WSTR (the default for string literals and return value for string functions). Like so:
(DT_STR, <length>, 1252)REPLACE(ColumnA, ".end", "")
To answer your other questions:
The performance impact will vary greatly depending on what your package is doing, how much data it processes, and how much headroom your server has. The same recommendations you'd follow in any data project hold here - use as minimal a size of data type as you can get away with.
SSIS created a whole new type system because it's designed to work with data other than that provided by SQL Server. There are other type systems out there - from other data providers, and in .Net (the scripting languages you can use) - so it needs a separate type system that maps "native" types for those endpoints to a common framework it can use to get work done. Unfortunately, it's close enough to SQL Server typing to get people comfortable, but different enough to cause confusion when it doesn't behave the same.
Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components[/url]
Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.
April 20, 2010 at 12:03 pm
Thanks so much for the insight into the unicode/non unicode as well as the hint on avoiding a Data Conversion component. Just curious, what is the "1252" code for in the expression:
(DT_STR, <length>, 1252)REPLACE(ColumnA, ".end", "")
Thanks again!
-Brigid
April 20, 2010 at 12:27 pm
It's the code page to use for interpreting the internal binary representations of the characters. Since you're speaking English, it's a sure bet that 1252 (the default) is fine for you.
Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components[/url]
Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.
August 2, 2010 at 1:02 pm
Todd--I'm trying to use the Derived Column transform with the replacement syntax you mentioned, but it's still failing for non-unicode versus unicode once I try to get it into my SQL table. (I'm using an Excel source, then a derived column transform and finally an OLE DB destination. In the derived column transform, I have the dervied column name (StudentKey) set to the same as the original with "Replace StudentKey" in the Derived Column entry and then I used your syntax in the (DT_STR,20,1252)REPLACE(studentkey,".end","") Expression Field. I don't know what I'm doing wrong. Can you help? See attachment for screenshot.
August 2, 2010 at 11:17 pm
The problem is that you're trying to replace the value in the "studentkey" column. That column is already defined as a DT_WSTR. Since it's an existing column that already has a data type - you can't change it. The data type it was originally defined with sticks with it forever. What you'll need to do is pull down the "Derived Column" value and select "add as new column". Then type a new column name in "Derived Column Name".
Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components
August 3, 2010 at 8:36 am
There are 187 columns in the Excel file I'm trying to import. Am I going to have to do a derived column for every single one of them? They're all failing for the unicode failure. Is there a better/faster way? I would have to created 187 derived columns and then map those to the OLEDB destination. There's got to be something faster I can do, right?
August 3, 2010 at 9:44 am
Not typically, no. You're hosed. That's life when reading and writing to Excel.
The best "quick" fix I can offer may not be something you want to use, and my advice may be suspect. Check the Replacing Data Conversion component[/url] at CodePlex.
Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components
September 7, 2010 at 10:36 am
Ok, I probably seem really dense to you, but I've been reading about this unicode-non-unicode issue all morning and now I'm really confused. I am now trying to export just a few columns to an Excel spreadsheet so a user can check them for errors. In order to avoid having to do conversion, I created my error table with all nvarchar columns, thinking I wouldn't have to do any conversion. My SQL table is all nvarchar and I dragged an OLEDB connection and Excel destination connection. I'm still getting the conversion problem. What am I doing wrong? Does it not matter that my source has non-unicode, same as the Excel destination? Do I still have to do conversion? If so, do I still have to do a derived column transform and give the new column a new name? I just don't understand.
September 8, 2010 at 9:18 am
Did you refresh your data source in the Data Flow? If you just changed the data type - no column name changes - then I don't think the source component will reload the metadata for the flow. Try this:
Right click on the source, select Advanced Editor.
Press the "refresh" button, then OK.
Address the reported metadata errors in the next component.
Hopefully that fixes things for you.
Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components
September 8, 2010 at 9:20 am
For anyone happening through this thread, these articles may be useful for you:
Conversion Between Unicode and Non-Unicode Data Flow Pattern
Convert Several Columns in SSIS With Fewer Clicks
Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components
September 8, 2010 at 11:59 am
Todd--You were exactly right!!!! I had not refreshed my data. Once I did, I no longer got the unicode error. I have read both of the blogs you referenced and am about to attempt using the first solution. Thank you SO much for your help. I hate this implicit/explicit conversion issue. It is such a pain when you do ETL for large quanities of data.
Quick rather unrelated question...my output destination is an Excel file. I want to overwrite that file each time my package is run. For some reason, it is appending to the file rather than overwriting it. I know that is a setting somewhere, but I can't find it. I've looked all over the properties of the connection manager and the data flow component itself. I'm stumped.
September 8, 2010 at 5:17 pm
I'm not sure about that one - my initial thought would be to use a File System Task to delete the file before your Data Flow executes.
Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply