Better way to filter/check data?

Question

Better way to filter/check data?

Pam Brisjar

SSChampion

Points: 12094
More actions
April 28, 2011 at 2:39 pm

#235466

Source system is a DB2 data warehouse which pulls data from an iSeries. I then take that Db2 data and pull it into SQL Server via SSIS. The issue is that somewhere along the line, some high-end unicode characters are getting tacked on to fields where we don't expect them. For downstream reasons I need to filter these out. I've done a lot of researching/testing/trial and error and came up with a couple of approaches.
My preferred solution would be to use custom code to validate the data. Using a function I saw posted here on SSC for testing ASCII values as my base, I came up with an in-line function that did what I wanted. Unfortunately it runs really, really slow and I have millions of rows to process. So I tried a table-valued function and APPLY. A little better, but still way too slow (benchmark 1,000 records in one hour).
So for now I'm using the SSIS data conversion component with an error branch. This works and performs well but has limitations I'm not happy with. Main one being I have no control over what is decided as "good" or "bad" other than it appropriately fitting the conversion data type. It also feels rather kludgey to me.
For the curious, here's the table-valued function I tried:
CREATE FUNCTION [dbo].[IsHighUnicodeTbl]
(
@OriginalText NVARCHAR(MAX)
)
RETURNS @Check TABLE ( subloc INT, isvalid INT )
BEGIN
;
WITH tally ( N )
AS ( SELECT TOP 1000000
row_number() OVER ( ORDER BY sc1.id )
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2
)
INSERT INTO @Check
( subloc ,
isvalid
)
SELECT Tally.N AS subloc ,
CASE WHEN UNICODE(SUBSTRING(@OriginalText, Tally.N,
1)) BETWEEN 32
AND
126 THEN 0
WHEN UNICODE(SUBSTRING(@OriginalText, Tally.N,
1)) > 126 THEN 1
ELSE 1
END AS isvalid
FROM Tally
WHERE Tally.N <= LEN(@OriginalText)
RETURN
END

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

Not sure if it would speed you up, but an inline script component checking the field as it passes by and using two output streams depending on results found would give you the same functionality in the stream.

Use a conditional splitter to fire the rows into separate streams with the same code to recombine later would allow forced parallelism.

There's really nothing I can think of in the calculated columns or other pre-built components that will let you do a character by character review of the inbound data inside of a single column.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

drew.allen SSC Guru Points: 76985 More actions · Answer 2

Do you really need to do a character by character review of the given column or just find columns that contain invalid characters? Perhaps you just need to return the first instance of a high ASCII character which you can easily do with PATINDEX().

SELECT PATINDEX('%[^ -~]', YourField)

FROM YourTable

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Pam Brisjar SSChampion Points: 12094 More actions · Answer 3

It's not ASCII, it's Unicode. Values go up to 65533 (decimal). Which, by the way, we do get.

Random sampling via PATINDEX is no good.

ZZartin SSC-Dedicated Points: 30920 More actions · Answer 4

Have you tried using a derived column in the data flow to replace the columns you're getting bad data in with cleaned data?

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 5

ZZartin (4/28/2011)
Have you tried using a derived column in the data flow to replace the columns you're getting bad data in with cleaned data?

While not a bad idea, the primary problem here is the location of bad data, not the correction of it.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

ZZartin SSC-Dedicated Points: 30920 More actions · Answer 6

Craig Farrell (4/28/2011)
ZZartin (4/28/2011)
Have you tried using a derived column in the data flow to replace the columns you're getting bad data in with cleaned data?
While not a bad idea, the primary problem here is the location of bad data, not the correction of it.

Ah good point, if all you want to do is exclude the rows you might be able to use a script component to parse the rows and only add the rows that are valid to the output.

Pam Brisjar SSChampion Points: 12094 More actions · Answer 7

The parsing of the rows is the issue, as the only way that I can find is to check the characters in the string for every column that might have the unusable characters. I'm already doing a convert to get things into a more friendly/usable format. But doing that convert is what causes the error.

drew.allen SSC Guru Points: 76985 More actions · Answer 8

Pam Brisjar (4/28/2011)
It's not ASCII, it's Unicode. Values go up to 65533 (decimal). Which, by the way, we do get.

Random sampling via PATINDEX is no good.

PATINDEX is NOT a random sampling. It searches all possible matches until it finds one or exhausts the search.

PATINDEX will work on Unicode strings as well as on ASCII strings.

If you need to get a result for every single character in the string, then PATINDEX will not work, but if you only need to know if a string contains any occurrences of certain characters, then PATINDEX could very well work.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA