Easy way to filter out invalid characters?

  • Just like the subject says - we have data coming from other systems where people have copied/pasted data from various programs, emails, web sites, etc and those include characters that are not found on your standard US keyboard layout. While that isn't always a problem, some of those characters cause exceptions in our code because they are not handled properly. While I realize that we can update the code at some point to handle these characters more gracefully, is there any relatively easy way to scrub all char/varchar columns in the database to replace items such as Vertical Tabs or similar entries with a space?

    Anyone done this before and willing to share the T-SQL? We have a little luxury here in being able to scrub incoming data if the process is efficient enough to fit in our pipeline. We just need to somehow do it prior to import into the production tables or otherwise "fix" the data in those tables.

    Thanks in advance for the help.

    -Peter

  • yes there is an easy efficient way;

    look at this post:

    http://www.sqlservercentral.com/Forums/Topic398428-338-1.aspx

    there, someone wanted to replace extended ascii with the html encode; ie '‘Special Digital Data Service Obligation’ got some characters stripped out, but left other items in place; your problem is exactly the same, except you'd substitute an empty string for the value instead.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Possibly the "most straightforward" might be to build a CLR UDF implementation of Regex.replace.

    Imports System

    Imports System.Data

    Imports System.Data.SqlClient

    Imports System.Data.SqlTypes

    Imports Microsoft.SqlServer.Server

    Imports System.Text.RegularExpressions

    Partial Public Class UserDefinedFunctions

    <Microsoft.SqlServer.Server.SqlFunction()>_

    Public Shared Function RegexReplace(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal rep As SqlString) As SqlString

    ' Add your code here

    Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value, optionS)

    Return New SqlString(rex.Replace(New String(input.Value), rep.Value))

    End Function

    End Class

    Publish that to your server - and you can then use syntax like:

    select dbo.regexreplace('a cat jumped over the fence','\x0B',' ')

    where (if my ASCII isn't too rusty) \x0B= ASCII 11 = vertical tab

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply