Split a string on Commas - except when it has a numeric character immediately on either side of it

  • I have a string splitting function I am using but now I am running into an issue where people put commas into numbers in a varchar field, and the string is being split on the commas in the number field.

    The current function is as follows .

    [Code]

    GO

    /****** Object: UserDefinedFunction [dbo].[Split] Script Date: 09/14/16 2:39:48 PM ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER FUNCTION [dbo].[Split] (

    @InputString VARCHAR(8000),

    @Delimiter VARCHAR(50)

    )

    RETURNS @Items TABLE (

    Item VARCHAR(8000)

    )

    AS

    BEGIN

    IF @Delimiter = ' '

    BEGIN

    SET @Delimiter = ','

    SET @InputString = REPLACE(@InputString, ' ', @Delimiter)

    END

    IF (@Delimiter IS NULL OR @Delimiter = '')

    SET @Delimiter = ','

    --INSERT INTO @Items VALUES (@Delimiter) -- Diagnostic

    --INSERT INTO @Items VALUES (@InputString) -- Diagnostic

    DECLARE @Item VARCHAR(8000)

    DECLARE @ItemList VARCHAR(8000)

    DECLARE @DelimIndex INT

    SET @ItemList = @InputString

    SET @DelimIndex = CHARINDEX(@Delimiter, @ItemList, 0)

    WHILE (@DelimIndex != 0)

    BEGIN

    SET @Item = SUBSTRING(@ItemList, 0, @DelimIndex)

    INSERT INTO @Items VALUES (@Item)

    -- Set @ItemList = @ItemList minus one less item

    SET @ItemList = SUBSTRING(@ItemList, @DelimIndex+1, LEN(@ItemList)-@DelimIndex)

    SET @DelimIndex = CHARINDEX(@Delimiter, @ItemList, 0)

    END -- End WHILE

    IF @Item IS NOT NULL -- At least one delimiter was encountered in @InputString

    BEGIN

    SET @Item = @ItemList

    INSERT INTO @Items VALUES (@Item)

    END

    -- No delimiters were encountered in @InputString, so just return @InputString

    ELSE INSERT INTO @Items VALUES (@InputString)

    RETURN

    END -- End Function

    [/code]

    I would like to add something into this function to ignore the delimiter if it is in a situation where there is a number on either side . I know for a LIKE statement, I could do something along the lines of LIKE '%[0-9],[0-9]%'. However, I am not really sure how I could add a check of that nature to this function.

  • Clean up your data before submitting it to your function.

    DECLARE @input VARCHAR(250) = 'abc,def,ghi,123,jkl,345,678,mno,901,234,567,pqr'

    WHILE PATINDEX('%[0-9],[0-9]%', @input) > 0

    SET @input = STUFF(@input, PATINDEX('%[0-9],[0-9]%', @input) + 1, 1, '')

    You should also change your function to something that performs much better. Search this site for DelimitedSplit8K.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • You might also look at the pattern-based splitter function found HERE[/url].

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • drew.allen (9/14/2016)


    Clean up your data before submitting it to your function.

    DECLARE @input VARCHAR(250) = 'abc,def,ghi,123,jkl,345,678,mno,901,234,567,pqr'

    WHILE PATINDEX('%[0-9],[0-9]%', @input) > 0

    SET @input = STUFF(@input, PATINDEX('%[0-9],[0-9]%', @input) + 1, 1, '')

    You should also change your function to something that performs much better. Search this site for DelimitedSplit8K.

    Drew

    I don't really understand how this works. The only part I really get is that the PATINDEX looks for that pattern, but I am not really sure how the STUFF portion of this works . I'd like to have a better understanding of what I am doing here.

  • Steven.Grzybowski (9/14/2016)


    drew.allen (9/14/2016)


    Clean up your data before submitting it to your function.

    DECLARE @input VARCHAR(250) = 'abc,def,ghi,123,jkl,345,678,mno,901,234,567,pqr'

    WHILE PATINDEX('%[0-9],[0-9]%', @input) > 0

    SET @input = STUFF(@input, PATINDEX('%[0-9],[0-9]%', @input) + 1, 1, '')

    You should also change your function to something that performs much better. Search this site for DelimitedSplit8K.

    Drew

    I don't really understand how this works. The only part I really get is that the PATINDEX looks for that pattern, but I am not really sure how the STUFF portion of this works . I'd like to have a better understanding of what I am doing here.

    The code searches for the pattern {digit}{comma}{digit} and returns the position in the string where the pattern is found. The STUFF function replaces the character at the found position + 1 with an empty string (so it removes the comma). The WHILE bit makes sure that all of the commas which match the search pattern are removed.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Alternatively, you could create a specialized version of DelimitedSplit8K_LEAD[/url] that handles this (I'll let you come up with a better name;-)). My modification is bold/underlined.

    CREATE FUNCTION [dbo].[DelimitedSplit8K_LEAD_XX]

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table” produces values from 0 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    AND (SUBSTRING(@pString,t.N-1,3) NOT LIKE CONCAT('[0-9]',@pDelimiter,'[0-9]'))

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))

    FROM cteStart s;

    Example:

    DECLARE @input VARCHAR(250) = 'abc,def,ghi,123,jkl,345,678,mno,901,234,567,pqr';

    SELECT * FROM dbo.DelimitedSplit8K_LEAD_XX(@input,',');

    Results:

    ItemNumber Item

    -------------------- -------------

    1 abc

    2 def

    3 ghi

    4 123

    5 jkl

    6 345,678

    7 mno

    8 901,234,567

    9 pqr

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Best way to fix this is to have a pork chop dinner with the people that are providing the data. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Some questions, since you submitted no sample data.

    The problem involves two or more numbers being strung together like this:

    23,346

    246,802

    7,890,123

    23,346,246,802,7,890,123

    Questions:

    First, does this varchar column contain ONLY numbers?

    Second, do all the numbers contain decimals as well as commas?

    Third, can you go back to the source of this data and tell them they have to use another delimiter instead of commas? "|" might be a good one.

    Edited to add: The best answer is #3. If they say no to that then pork chops should definitely be on the menu. The data entry is responsible for making the data digestible. Garbage in, garbage out.

    Edited to add question #4: Are all the numbers the same number of digits?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Jeff Moden (9/14/2016)


    Best way to fix this is to have a pork chop dinner with the people that are providing the data. πŸ˜‰

    I wish I could. The problem is that the comma delimited data are a result of a CRM, and I am pretty sure I would have a revolt by the account executives if I told them to not put commas in their numbers.

  • Steven.Grzybowski (9/15/2016)


    Jeff Moden (9/14/2016)


    Best way to fix this is to have a pork chop dinner with the people that are providing the data. πŸ˜‰

    I wish I could. The problem is that the comma delimited data are a result of a CRM, and I am pretty sure I would have a revolt by the account executives if I told them to not put commas in their numbers.

    So change the column delimiter to something which does not otherwise appear. | is often a good candidate.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (9/15/2016)


    Steven.Grzybowski (9/15/2016)


    Jeff Moden (9/14/2016)


    Best way to fix this is to have a pork chop dinner with the people that are providing the data. πŸ˜‰

    I wish I could. The problem is that the comma delimited data are a result of a CRM, and I am pretty sure I would have a revolt by the account executives if I told them to not put commas in their numbers.

    So change the column delimiter to something which does not otherwise appear. | is often a good candidate.

    +1

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Hey, I want a +1 too !!

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • So change the column delimiter to something which does not otherwise appear. | is often a good candidate.

    or better yet fix the input routines to strip out the superfluous commas before they even get to the database. If they need to display them use an appropriate format.

    Store numeric data in numeric columns not character ones.

    I suppose you should be grateful you don't need to handle values formatted for France as well. i.e. 1.000.000,000.000 :w00t:

  • crmitchell (9/20/2016)


    I suppose you should be grateful you don't need to handle values formatted for France as well. i.e. 1.000.000,000.000 :w00t:

    I never understand numbers that have multiple periods in the middle of them...

    I mean, i gave you the number "123.456.789" what number is that?

    1. One hundred and twenty three million, four hundred and fifty six thousand, seven hundred and eighty nine

    2. One hundred and twenty three thousand, four hundred and fifty six, point seven eight nine

    3. One hundred and twenty three, point four five six seven eight nine

    4. It's not a number, of some kind of GPS co-ordinate

    :pinch:

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Just for the fun of it... Performance is about the same as what Alan posted previously...

    DECLARE

    @String VARCHAR(250) = '123,346,abc,def,ghi,123,jkl,345,678,mno,901,234,567,pqr',

    @Delimiter CHAR(1) = ',';

    WITH

    n (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),

    cte_Tally (n) AS (

    SELECT TOP (DATALENGTH(@String))

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM n n1, n n2, n n3, n n4

    ),

    cte_Split AS (

    SELECT

    t.N,

    Val = SUBSTRING(@String, t.N, 1),

    LagLead = CONCAT(SUBSTRING(@String, t.N - 1, 1), SUBSTRING(@String, t.N + 1, 1))

    FROM

    cte_Tally t

    ),

    cte_Starts AS (

    SELECT N = 0

    UNION ALL

    SELECT

    s.N

    FROM

    cte_Split s

    WHERE

    s.Val = @Delimiter

    AND TRY_CAST(s.LagLead AS INT) IS NULL

    )

    SELECT

    ItemNumber = ROW_NUMBER() OVER (ORDER BY s.N),

    Item = SUBSTRING(@String, s.n +1, ISNULL(LEAD(s.n, 1) OVER (ORDER BY s.N), 8000) - s.n -1)

    FROM

    cte_Starts s;

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply