substring comparison for last 2 characters

  • I have a field with data like below I need to join on

    patid

    123453

    124344ME

    323390

    3233MS

    I need to remove the last 2 charcters if they are not numeric. So my results for the join should be

    patid

    123453

    124344

    323390

    3233

  • timscronin (7/1/2013)


    I have a field with data like below I need to join on

    patid

    123453

    124344ME

    323390

    3233MS

    I need to remove the last 2 charcters if they are not numeric. So my results for the join should be

    patid

    123453

    124344

    323390

    3233

    What would you do with "1234C5" and "12345C" ??


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • wont happen, the records with the 2 characters at the end are state abbreviations added on

  • A possible solution would be like this. I'm worried about performance though. Couldn't you correct the data in the first place?

    CREATE TABLE #Table1( patid varchar(10))

    CREATE TABLE #Table2( patid varchar(10))

    INSERT INTO #Table1 VALUES('123453'),('124344ME'),('323390'),('3233MS')

    INSERT INTO #Table2 VALUES('123453'),('124344'),('323390'),('3233')

    SELECT *

    FROM #Table1 a

    JOIN #Table2 b ON CASE WHEN a.patid NOT LIKE '%[^0-9]%' THEN a.patid ELSE LEFT(a.patid, LEN(a.patid) - 2) END = b.patid

    DROP TABLE #Table1

    DROP TABLE #Table2

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Here is one more way:

    create table Demo (Col1 varchar(10))

    go

    insert into Demo(Col1) values ('123453'),('124344ME'),('123453'),('124344ME'),('323390'),('3233MS')

    select Col1, substring(Col1,1, CASE WHEN Col1 like '%[aA-zZ]' THEN len(Col1)-2 else len(Col1) end)

    from Demo

    --------------------------------------------------------------
    To know how to ask questions and increase the chances of getting asnwers:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • problem is with data like this (my bad should have included it)

    1233IL

    22M33

    Would exclude 2nd record

  • Create some sample data:

    IF OBJECT_ID('tempdb..#SampleData') IS NOT NULL

    DROP TABLE #SampleData

    CREATE TABLE #SampleData (

    [ID] INT IDENTITY(1,1) NOT NULL,

    [strVal] NVARCHAR(50) NULL,

    PRIMARY KEY (ID))

    INSERT INTO #SampleData

    VALUES

    ('123453ME'),('124344'),('323390'),('32339826455MS'),

    ('345678'),('39WA'),('986545'),('1234AK'),

    ('345XX8'),('398644GA'),('F986545'),('98234AZ')

    Now query the data and strip off the unwanted suffixes

    as well as eliminating any entries with an embedded alpha char.

    SELECT

    r.strVal

    FROM

    (

    SELECT

    (CASE

    WHEN PATINDEX('%[A-Za-z]%',RIGHT(strVal,2)) > 0

    THEN REPLACE(strVal,RIGHT(strVal,2),'')

    ELSE strVal

    END) AS strVal

    FROM

    #SampleData AS sd

    ) r

    WHERE

    PATINDEX('%[A-Za-z]%',strVal) = 0

     

  • SELECT

    mt.##col1##, ..., jt.##co11##, ...

    FROM dbo.maintable mt

    LEFT OUTER JOIN jointable jt ON

    --ignore last two bytes of patid if they are alpha, because it's a state abbrev, not part of key value

    jt.key = LEFT(mt.patid, LEN(mt.patid) - CASE WHEN RIGHT(mt.patid, 2) LIKE '[a-z][a-z]' THEN 2 ELSE 0 END)

    Edit: Added sql code tags to colorize code.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • timscronin (7/1/2013)


    wont happen, the records with the 2 characters at the end are state abbreviations added on

    Then it's easy... Build a "State" table (you should have one anyway) and the your WHERE clause would look like the following...

    WHERE RIGHT(PatID,2) IN (SELECT StateAbbv FROM dbo.State)

    Be advised that no matter what you do, performance is going to stink because of the PatID column being wrapped in a function. What I'd recommend doing if you need to do this a lot is to make the RIGHT(Pat,2) a PERSISTED calculated column in the table and index it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.

    select dbo.ExtractInteger(patid) as [patid]

    from table

    USE [Sample]

    GO

    /****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))

    RETURNS VARCHAR(1000)

    AS

    BEGIN

    DECLARE @Count INT

    DECLARE @IntNumbers VARCHAR(1000)

    SET @Count = 0

    SET @IntNumbers = ''

    WHILE @Count <= LEN(@String)

    BEGIN

    IF SUBSTRING(@String,@Count,1) >= '0'

    AND SUBSTRING(@String,@Count,1) <= '9'

    BEGIN

    SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)

    END

    SET @Count = @Count + 1

    END

    RETURN @IntNumbers

    END

  • erikd (7/2/2013)


    I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.

    select dbo.ExtractInteger(patid) as [patid]

    from table

    USE [Sample]

    GO

    /****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))

    RETURNS VARCHAR(1000)

    AS

    BEGIN

    DECLARE @Count INT

    DECLARE @IntNumbers VARCHAR(1000)

    SET @Count = 0

    SET @IntNumbers = ''

    WHILE @Count <= LEN(@String)

    BEGIN

    IF SUBSTRING(@String,@Count,1) >= '0'

    AND SUBSTRING(@String,@Count,1) <= '9'

    BEGIN

    SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)

    END

    SET @Count = @Count + 1

    END

    RETURN @IntNumbers

    END

    SVFs are notoriouly poor for performance, even when measured correctly.

    Initially I'd do it like this. Until that is I had the time to go back and determine the fastest method.

    WITH Phones (phone) AS (

    SELECT '66814382803' UNION ALL SELECT '123A123456789'

    UNION ALL SELECT '0814382803'

    )

    SELECT Item

    FROM Phones

    CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')

    WHERE [Matched] = 1 AND LEN(Item) BETWEEN 7 AND 11

    I am curious what your function would return for that second phone number in my list.

    PatternSplitCM is a iTVF that can be found in the 4th article in my signature links. Since it is a general purpose function, much faster ways could probably be devised that would get you to the same result.

    Note that PatternSplitCM would also work for the OP's case.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/2/2013)


    erikd (7/2/2013)


    I use the following SVF to extract integers only from phone number columns. So, you would run the following and it would return only the numbers from your patid column.

    select dbo.ExtractInteger(patid) as [patid]

    from table

    USE [Sample]

    GO

    /****** Object: UserDefinedFunction [dbo].[ExtractInteger] Script Date: 07/02/2013 11:17:35 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE FUNCTION [dbo].[ExtractInteger](@String VARCHAR(2000))

    RETURNS VARCHAR(1000)

    AS

    BEGIN

    DECLARE @Count INT

    DECLARE @IntNumbers VARCHAR(1000)

    SET @Count = 0

    SET @IntNumbers = ''

    WHILE @Count <= LEN(@String)

    BEGIN

    IF SUBSTRING(@String,@Count,1) >= '0'

    AND SUBSTRING(@String,@Count,1) <= '9'

    BEGIN

    SET @IntNumbers = @IntNumbers + SUBSTRING(@String,@Count,1)

    END

    SET @Count = @Count + 1

    END

    RETURN @IntNumbers

    END

    SVFs are notoriouly poor for performance, even when measured correctly.

    Initially I'd do it like this. Until that is I had the time to go back and determine the fastest method.

    WITH Phones (phone) AS (

    SELECT '66814382803' UNION ALL SELECT '123A123456789'

    UNION ALL SELECT '0814382803'

    )

    SELECT Item

    FROM Phones

    CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')

    WHERE [Matched] = 1 AND LEN(Item) BETWEEN 7 AND 11

    I am curious what your function would return for that second phone number in my list.

    PatternSplitCM is a iTVF that can be found in the 4th article in my signature links. Since it is a general purpose function, much faster ways could probably be devised that would get you to the same result.

    Note that PatternSplitCM would also work for the OP's case.

    Hi Dwain,

    I work with a lot of tables, but none of them are very large. The largest one I currently deal with is for a monthly tracker that is recently up to around 300k rows, so I don't notice a lot (if any) performance increases that I see people talk about. So, for my purposes, the SVF works well.

    Here's what it returns from the second number you posted:

    select dbo.extractinteger('123A123456789') as [For Dwain]

    For Dwain

    123123456789

    I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.

  • erikd (7/2/2013)


    I work with a lot of tables, but none of them are very large.

    Please don't get me wrong when I say this. That is today. Once your application is successful beyond your wildest dreams and your tables start having many millions of rows, you'll probably be forced to rethink that statement. Performance should rarely be measured as "good enough," mainly because by the time it becomes "not good enough" it will be a serious challenge to make it good enough once again.

    erikd (7/2/2013)


    Here's what it returns from the second number you posted:

    select dbo.extractinteger('123A123456789') as [For Dwain]

    For Dwain

    123123456789

    I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.

    I knew the answer but what I was wondering is does that seem like the right thing to do? For example suppose the string was 123456789x1231 where x denotes an extension.

    I hope you find value in the article. Like I said, I consider PatternSplitCM to be sort of a stopgap even though for the task it does it performs reasonably well. Most of the time, you'll find you can construct a more direct approach to a particular problem that will run faster.

    PatternSplitCM will get you to working code faster but with a little more effort you can get your code working faster.

    Did I just say that? 😉


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/2/2013)


    erikd (7/2/2013)


    I work with a lot of tables, but none of them are very large.

    Please don't get me wrong when I say this. That is today. Once your application is successful beyond your wildest dreams and your tables start having many millions of rows, you'll probably be forced to rethink that statement. Performance should rarely be measured as "good enough," mainly because by the time it becomes "not good enough" it will be a serious challenge to make it good enough once again.

    erikd (7/2/2013)


    Here's what it returns from the second number you posted:

    select dbo.extractinteger('123A123456789') as [For Dwain]

    For Dwain

    123123456789

    I have your string splitter link open in a new tab. Will read and get back to you when I've tried it out a bit.

    I knew the answer but what I was wondering is does that seem like the right thing to do? For example suppose the string was 123456789x1231 where x denotes an extension.

    I hope you find value in the article. Like I said, I consider PatternSplitCM to be sort of a stopgap even though for the task it does it performs reasonably well. Most of the time, you'll find you can construct a more direct approach to a particular problem that will run faster.

    PatternSplitCM will get you to working code faster but with a little more effort you can get your code working faster.

    Did I just say that? 😉

    No, you're right. Good enough isn't good enough. What I was getting at is that I don't have a table large enough to test good enough against better or best. I don't even have a particularly good testing environment (a SQL 2005 instance on a rather ancient and rickety server that half a dozen people RDP into at a time). But, a guy can dream, when he's done dreaming about Johnnie Walker Blue and rare steaks.

    Phone numbers with extensions are screened out, because they can't be auto dialed, so it's GENERALLY not an issue, though it's caused a few headaches when clients send over their contact lists with everything plopped into one column.

    I am going to try to thumb wrestle your pattern splitter to do what mine does and just return the cleaned result.

  • Erik - Just because your Prod system doesn't have a million rows to test against, doesn't mean you can't test against a million rows!

    The test harness in the article demonstrates the generic approach for this (which I do not take credit for by the way).

    This would directly compare PatternSplitCM against your SVF.

    CREATE TABLE #Phones (phone VARCHAR(12));

    WITH Tally (n) AS (

    SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b)

    INSERT INTO #Phones

    SELECT 1000000+ABS(CHECKSUM(NEWID()))%1000000

    FROM Tally;

    DECLARE @Phone VARCHAR(12), @StartDT DATETIME;

    PRINT 'PatternSplitCM';

    SELECT @StartDT = GETDATE();

    SET STATISTICS TIME ON;

    SELECT @Phone=Item

    FROM #Phones

    CROSS APPLY dbo.PatternSplitCM(phone, '[0-9]')

    WHERE [Matched]=1;

    SET STATISTICS TIME OFF;

    SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

    PRINT 'extractinteger';

    SELECT @StartDT = GETDATE();

    SET STATISTICS TIME ON;

    SELECT @Phone=dbo.extractinteger(phone)

    FROM #Phones;

    SET STATISTICS TIME OFF;

    SELECT ElapsedMS=DATEDIFF(millisecond, @StartDT, GETDATE());

    GO

    DROP TABLE #Phones;

    Edit: I modified it to SELECT the ElapsedMS of each run, which is needed due to a bias that is introduced by applying STATISTICS TIME to an SVF as demonstrated here: http://www.sqlservercentral.com/articles/T-SQL/91724/


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply