Consultant about like operator

  • I'm dealing with filtering system which should control whether input string contains bad words or not. In this regard there is table which contains bad words like below :

    CREATE TABLE [dbo].[BadWords](

    [Id] [int] NOT NULL,

    [Word] [nvarchar](50) NOT NULL,

    CONSTRAINT [PK_BadWords] PRIMARY KEY CLUSTERED

    (

    [Id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    First of all, all delimiters should be replaced in input string due to support alternative shape of word for instance consider 'bcd' is bad word, is should determine 'abc,de' is bad word(all delimiters removed).

    Next, check if input string contains any bad word or not?

    DECLARE @Input NVARCHAR(640) = 'AAB,CDE'

    SET @Input =REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Input ,',',''),' ',''),'-',''),'_',''),'.','')

    IF NOT EXISTS(SELECT TOP(1) 1 FROM [dbo].[BadWords] WITH(NOLOCK) WHERE @Input LIKE N'%'+Word+N'%')

    as it is obvious owing to % sql wont use index on Word and make it response slowly. Do you have any solution to optimize it for example other way or sth else like full text search ?

    Tnx

  • You could break your string into its individual words, using something like Jeff Moden's splitter[/url] and then match those against your bad words table. That should run fairly quickly.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • I cannot determine individual word because user may manipulate it for example consider this sentence : i'm de,vil

    If devil is bad word you cannot detect it! In this regard I replace delimiters and then compare with bad words data

  • So do the replace first, then the split.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • The point is I cannot determine word ! As I said consider i'm dev,il

    After replace it turns to "imdevil" ! Seprating charachters need artificial intelligence which isnot related to my question.

  • farax_x (3/11/2015)


    The point is I cannot determine word ! As I said consider i'm dev,il

    After replace it turns to "imdevil" ! Seprating charachters need artificial intelligence which isnot related to my question.

    You seem to be contradicting yourself. In your original post you talk about replacing certain characters with other so you can look for these offending words. But now you say you can't to it.

    As a side note, if your input contains full sentences your logic is going to be far more challenging because you are removing commas and periods.

    Another side note...why the nolock hint on your BadWords table?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (3/11/2015)


    farax_x (3/11/2015)


    The point is I cannot determine word ! As I said consider i'm dev,il

    After replace it turns to "imdevil" ! Seprating charachters need artificial intelligence which isnot related to my question.

    You seem to be contradicting yourself. In your original post you talk about replacing certain characters with other so you can look for these offending words. But now you say you can't to it.

    As a side note, if your input contains full sentences your logic is going to be far more challenging because you are removing commas and periods.

    Another side note...why the nolock hint on your BadWords table?

    I should clarify issue. Consider a table content like below

    INSERT INTO dbo.BadWords

    ( Id, Word )

    VALUES ( 1, N'badword' )

    and a sentence such as @input ='this is bad,word'.

    I want to check if @input contain any bad words. As it is clear if I use original value of @input it wont have result

    SELECT TOP ( 1 )

    *

    FROM [dbo].[BadWords] WITH ( NOLOCK )

    WHERE @Input LIKE N'%' + Word + N'%'

    ;thus, to get exact result I should replace all delimiters to have one pure sentence

    SET @Input =REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Input ,',',''),' ',''),'-',''),'_',''),'.','')

    therefor 'this is bad,word' changes to 'thisisbadword' and if use previous select it would have result.

    The issue is performance of this query and algorithm ! Let me know if the is better way to check if text like 'thisisbadword' contains any match in table !! The problem of my Select is % which cause Sql not uses index on it.

    I use NOLOCK due to reduce locking on this table additionally dirty read is not important in this case.

  • farax_x (3/11/2015)


    Sean Lange (3/11/2015)


    farax_x (3/11/2015)


    The point is I cannot determine word ! As I said consider i'm dev,il

    After replace it turns to "imdevil" ! Seprating charachters need artificial intelligence which isnot related to my question.

    You seem to be contradicting yourself. In your original post you talk about replacing certain characters with other so you can look for these offending words. But now you say you can't to it.

    As a side note, if your input contains full sentences your logic is going to be far more challenging because you are removing commas and periods.

    Another side note...why the nolock hint on your BadWords table?

    I should clarify issue. Consider a table content like below

    INSERT INTO dbo.BadWords

    ( Id, Word )

    VALUES ( 1, N'badword' )

    and a sentence such as @input ='this is bad,word'.

    I want to check if @input contain any bad words. As it is clear if I use original value of @input it wont have result

    SELECT TOP ( 1 )

    *

    FROM [dbo].[BadWords] WITH ( NOLOCK )

    WHERE @Input LIKE N'%' + Word + N'%'

    ;thus, to get exact result I should replace all delimiters to have one pure sentence

    SET @Input =REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@Input ,',',''),' ',''),'-',''),'_',''),'.','')

    therefor 'this is bad,word' changes to 'thisisbadword' and if use previous select it would have result.

    The issue is performance of this query and algorithm ! Let me know if the is better way to check if text like 'thisisbadword' contains any match in table !! The problem of my Select is % which cause Sql not uses index on it.

    So since you are trying to deal with sentences your logic is flawed. What about something like this...

    "It smells. Except on Tuesday". This clearly has no bad words right...but using your logic of removing all punctuation and spaces this previously harmless sentence now looks like this: "itsmellsexceptontuesday"

    Suddenly a new bad word appears out of nowhere. Now if you mean the input is not intended to be sentences and instead is something more like passwords and/or usernames that are supposed to be a single word and you are trying to potentially bad patterns that is a little different. However even that is a bit challenging. I once tried to use the name "Bubbles Bodine" in a game and was told I couldn't use that name because it contains bad words. Can you spot the bad word? It took me a LONG time to figure that out.

    I use NOLOCK due to reduce locking on this table additionally dirty read is not important in this case.

    This BadWords table appears to be a pretty static table so locking is highly unlikely in the first place. If it is a table that is in motion dirty reads are only a portion of the problem. Missing and/or duplicate data is possible. All too often this hint is thrown around because it "makes the query faster". That speed has the potential for some serious cost that most people don't understand.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Just as a side bar, I worked in a travel company with a rather aggressive profanity filter on the mail server, and bounce legitimate emails that contained the word Bastardise(d), which was annoying when talking about changing a route, or having to create a solution.

    Also you have to consider, that there are names especially in European Languages like Dutch, French, Spanish, Itallian, Dutch that could fall foul of your rules, eg Abraham Benjamin de Villiers (A South African Cricketer), or what about someone talking about 101 Dalmations, and Cruella De Vil, that also would fall foul of the example you gave.

    Then there are also 'profanities' within words, Scunthorpe, Cummulative, Cumberland, Essex, etc.

    So you also then need an exception filter for words that have profanities, but are actually regular words.

    How do you handle typo's like Cuntry?

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • It's still not clear whether or not you want to reject "made vile" as unacceptable input for containing a bad word.

    If it's acceptable input, then the recommendations about stripping out separators other than white space, then extracting individual words, come into play.

    If it contains a bad word then you must continue to replace the space character.

  • I think other people here have merits in pointing out the shortcomings of your system in terms of it's infallibility 🙂 ,

    but

    here is a function for you to clean up the string.

    I have found here,(I think), and fix up a bit to work as I needed.

    it should be easier to work with than a row of replaces, more flexible too.

    called like this

    SELECT dbo.fn_RegExCleanString('123 FAIR''VIEW AVE; HUDSON: N.Y. 12534-0000',DEFAULT,DEFAULT,DEFAULT)

    it will strip the input string for all but AlphaNumeric characters

    but will not trim the outcome, changing last "DEFAULT" to 1 will also trim the output string.

    using a RegEx as second parameter let you create your own filter as needed.

    works very well for me.

    CREATE FUNCTION [dbo].[fn_RegExCleanString]

    (

    @Input VARCHAR(MAX)

    ,@Expression VARCHAR(500) = '%[^a-zA-Z0-9]%'

    ,@ReplaceWith VARCHAR(100) = ''

    ,@TrimOutput BIT = 0

    )

    RETURNS VARCHAR(MAX)

    AS

    BEGIN

    -- Make sure we use default values when needed.

    SELECT @Expression = COALESCE(@Expression,'%[^a-zA-Z0-9]%')

    ,@ReplaceWith = COALESCE(@ReplaceWith,'')

    declare @ExpLen int

    set @ExpLen = CASE WHEN LEN(@ReplaceWith)>0 THEN LEN(@ReplaceWith) ELSE 1 END

    WHILE PATINDEX(@Expression, @Input) > 0

    BEGIN

    SET @Input = STUFF(@Input, PATINDEX(@Expression, @Input), @ExpLen, @ReplaceWith)

    END

    IF(@TrimOutput > 0)RETURN LTRIM(RTRIM(@Input))

    RETURN @Input

    END

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply