Better way to filter/check data?

  • Source system is a DB2 data warehouse which pulls data from an iSeries. I then take that Db2 data and pull it into SQL Server via SSIS. The issue is that somewhere along the line, some high-end unicode characters are getting tacked on to fields where we don't expect them. For downstream reasons I need to filter these out. I've done a lot of researching/testing/trial and error and came up with a couple of approaches.

    My preferred solution would be to use custom code to validate the data. Using a function I saw posted here on SSC for testing ASCII values as my base, I came up with an in-line function that did what I wanted. Unfortunately it runs really, really slow and I have millions of rows to process. So I tried a table-valued function and APPLY. A little better, but still way too slow (benchmark 1,000 records in one hour).

    So for now I'm using the SSIS data conversion component with an error branch. This works and performs well but has limitations I'm not happy with. Main one being I have no control over what is decided as "good" or "bad" other than it appropriately fitting the conversion data type. It also feels rather kludgey to me.

    For the curious, here's the table-valued function I tried:

    CREATE FUNCTION [dbo].[IsHighUnicodeTbl]

    (

    @OriginalText NVARCHAR(MAX)

    )

    RETURNS @Check TABLE ( subloc INT, isvalid INT )

    BEGIN

    ;

    WITH tally ( N )

    AS ( SELECT TOP 1000000

    row_number() OVER ( ORDER BY sc1.id )

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2

    )

    INSERT INTO @Check

    ( subloc ,

    isvalid

    )

    SELECT Tally.N AS subloc ,

    CASE WHEN UNICODE(SUBSTRING(@OriginalText, Tally.N,

    1)) BETWEEN 32

    AND

    126 THEN 0

    WHEN UNICODE(SUBSTRING(@OriginalText, Tally.N,

    1)) > 126 THEN 1

    ELSE 1

    END AS isvalid

    FROM Tally

    WHERE Tally.N <= LEN(@OriginalText)

    RETURN

    END

  • Not sure if it would speed you up, but an inline script component checking the field as it passes by and using two output streams depending on results found would give you the same functionality in the stream.

    Use a conditional splitter to fire the rows into separate streams with the same code to recombine later would allow forced parallelism.

    There's really nothing I can think of in the calculated columns or other pre-built components that will let you do a character by character review of the inbound data inside of a single column.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Do you really need to do a character by character review of the given column or just find columns that contain invalid characters? Perhaps you just need to return the first instance of a high ASCII character which you can easily do with PATINDEX().

    SELECT PATINDEX('%[^ -~]', YourField)

    FROM YourTable

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • It's not ASCII, it's Unicode. Values go up to 65533 (decimal). Which, by the way, we do get.

    Random sampling via PATINDEX is no good.

  • Have you tried using a derived column in the data flow to replace the columns you're getting bad data in with cleaned data?

  • ZZartin (4/28/2011)


    Have you tried using a derived column in the data flow to replace the columns you're getting bad data in with cleaned data?

    While not a bad idea, the primary problem here is the location of bad data, not the correction of it.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (4/28/2011)


    ZZartin (4/28/2011)


    Have you tried using a derived column in the data flow to replace the columns you're getting bad data in with cleaned data?

    While not a bad idea, the primary problem here is the location of bad data, not the correction of it.

    Ah good point, if all you want to do is exclude the rows you might be able to use a script component to parse the rows and only add the rows that are valid to the output.

  • The parsing of the rows is the issue, as the only way that I can find is to check the characters in the string for every column that might have the unusable characters. I'm already doing a convert to get things into a more friendly/usable format. But doing that convert is what causes the error.

  • Pam Brisjar (4/28/2011)


    It's not ASCII, it's Unicode. Values go up to 65533 (decimal). Which, by the way, we do get.

    Random sampling via PATINDEX is no good.

    PATINDEX is NOT a random sampling. It searches all possible matches until it finds one or exhausts the search.

    PATINDEX will work on Unicode strings as well as on ASCII strings.

    If you need to get a result for every single character in the string, then PATINDEX will not work, but if you only need to know if a string contains any occurrences of certain characters, then PATINDEX could very well work.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply