How do I extract information from a string

  • How do I extract the following information from a string. The string varies depending if it is an invoice or a PO. Also, the Invoice number is not always the same field length.

    Any help would be appreciated.

    This is an example of the string:

    V# H019978 I# 95492175 R# 854954

    V# H020945 PO# 852194 ACCRUAL

    I need a column for Vendor, InvoiceNum, PONum, ReturnNum

    If there is no invoice # I need the field to be blank for that column. The same for the PO and return.

    So the columns should be like:

    Vendor InvoiceNum PONum Return

    H019978 95492175 854954

    H020945 852194

    Wendy

  • Hi Wendy

    Sure that is possible with SQL, but I would advice to do it with a client application or SSIS. How do you get this string?

    If you really want to do this by SQL give a short feedback.

    Greets

    Flo

  • Thanks for Responding.

    I am using a data repository to create reports that I write for our Hospital Information Systems that utilizes Visual Studio 2005.

    I am working under the DATA tab where I am creating my query.

    The field is TxnDescription in the dbo.GLBATCHMoneyBatchTransactions table that holds this string.

    I am fairly new with utilizing this software and learining SQL at the same time.

    Wendy

  • [font="Verdana"]Here's some example code.

    Note that I have assumed that there is always a space between the "V#" and the actual vendor number. I've also assumed that vendor numbers are always in the form "H[0-9]+" (an H followed by one or more digits). I've made similar assumptions about the other fields. You can vary the code to cope if any of those assumptions aren't true.

    declare @workedExample table(

    LineNumber int identity(1,1) not null primary key,

    Line varchar(200) not null

    );

    insert into @workedExample(

    Line

    )

    select 'V# H019978 I# 95492175 R# 854954' as Line union all

    select 'V# H020945 PO# 852194 ACCRUAL' as Line union all

    select 'V# H0123' as Line;

    with

    LineDetails as (

    select LineNumber,

    --

    -- add a space on the end of the line to ensure there is a

    -- non-digit character to find after the last field

    --

    Line + ' ' as Line,

    charindex('V#', Line) as VendorPosition,

    charindex('I#', Line) as InvoicePosition,

    charindex('PO#', Line) as POPosition,

    charindex('R#', Line) as ReturnPosition

    from @workedExample

    )

    select LineNumber,

    Line,

    --

    -- show the part of the line where we've skipped past the 'V# ' part

    --

    substring(

    Line,

    VendorPosition + 3,

    200

    ) as Line_StartWithVendor,

    --

    -- skip past not only the V#, but also the leading H. then find the first

    -- non digit. that should give us the length of the vendor number. note

    -- that because we skip past the leading H, but actually include it in the

    -- returned string, we don't need the -1 on the length that we do for the

    -- other fields (since we're including an extra character)

    --

    patindex(

    '%[^0-9]%',

    substring(

    Line,

    VendorPosition + 4,

    200

    )) as VendorLength,

    --

    -- now put them all together to show the finished product: just the vendor

    -- number

    --

    case VendorPosition

    when 0 then null

    else

    substring(

    Line,

    VendorPosition + 3,

    patindex('%[^0-9]%', substring(Line, VendorPosition + 4, 200))

    )

    end as Vendor,

    --

    -- repeat the exercise for the other numbers

    --

    case InvoicePosition

    when 0 then null

    else

    substring(

    Line,

    InvoicePosition + 3,

    patindex('%[^0-9]%', substring(Line, InvoicePosition + 3, 200)) - 1

    )

    end as InvoiceNum,

    case POPosition

    when 0 then null

    else

    substring(

    Line,

    POPosition + 4,

    patindex('%[^0-9]%', substring(Line, POPosition + 4, 200)) - 1

    )

    end as PONum,

    case ReturnPosition

    when 0 then null

    else

    substring(

    Line,

    ReturnPosition + 3,

    patindex('%[^0-9]%', substring(Line, ReturnPosition + 3, 200)) - 1

    )

    end as [Return]

    from LineDetails;

    The XML solution GSquared has presented below is probably the more elegant solution, but I'm not that much of an XML guru with SQL Server. So this approach just uses plain old SQL functions. If you are using SQL Server 2000 or earlier, you can replace the part after the with (a common table expression, or CTE) with an inline view.

    Note: extensive string manipulation in SQL isn't the easiest thing in the world. You can do it, but you should seriously consider whether you should do it. You may be better using a routine coded in C# or VB and linked in via the CLR, or you may be better writing an SSIS package to do it.

    [/font]

  • Here's one way to work with this kind of thing. This solution assumes you have a Numbers table, and that you are working in SQL 2005 (as per the forum you posted in). If you don't have a Numbers table, you can either look up how to create one (there are many examples on this site), or ask and I'll tell you.

    declare @String varchar(1000), @String2 varchar(1000), @XML XML;

    select @String = 'V# H019978 I# 95492175 R# 854954';

    select @String2 = coalesce(@String2 + ' ' +

    replace(substring(@String, number-(number-charindex(' ', ' ' + @String, number-2)), charindex(' ', @String+' ', number+2)+2-number), '# ', '="')+'"',

    replace(substring(@String, number-(number-charindex(' ', ' ' + @String, number-2)), charindex(' ', @String+' ', number+2)+2-number), '# ', '="')+'"')

    from dbo.Numbers

    where number <= len(@String)

    and substring(@String, number, 1) = '#';

    select @XML = ' ';

    select

    @XML.value('(row/@V)[1]','varchar(100)') as Vendor,

    @XML.value('(row/@I)[1]','varchar(100)') as InvoiceNum,

    @XML.value('(row/@R)[1]','varchar(100)') as [Return],

    @XML.value('(row/@PO)[1]','varchar(100)') as PONum;

    select @String2 = null;

    select @String = 'V# H020945 PO# 852194 ACCRUAL';

    select @String2 = coalesce(@String2 + ' ' +

    replace(substring(@String, number-(number-charindex(' ', ' ' + @String, number-2)), charindex(' ', @String+' ', number+2)+2-number), '# ', '="')+'"',

    replace(substring(@String, number-(number-charindex(' ', ' ' + @String, number-2)), charindex(' ', @String+' ', number+2)+2-number), '# ', '="')+'"')

    from dbo.Numbers

    where number <= len(@String)

    and substring(@String, number, 1) = '#';

    select @XML = ' ';

    select

    @XML.value('(row/@V)[1]','varchar(100)') as Vendor,

    @XML.value('(row/@I)[1]','varchar(100)') as InvoiceNum,

    @XML.value('(row/@R)[1]','varchar(100)') as [Return],

    @XML.value('(row/@PO)[1]','varchar(100)') as PONum;

    You'll have to modify it to work with your set-up, but the string manipulations and XML creation and query are the core of the solution.

    If you have a table structure you need to work with, provide the create script for that, and a few lines of inserts for sample data, and we can help modify this to work with that.

    Edit: Looks like there was a change in the way that code blocks work on the page since I posted the above. It removed a lot of my line-breaks and blank lines. Trying to put them back in, we'll see how I do on that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi Wendy

    If you are using Visual Studio you should either create a CLR function or work with SSIS.

    Here an example with C# and RegEx (use a form and a button):

    private void button1_Click_1(object sender, EventArgs e)

    {

    try

    {

    string input = "V# H019978 I# 95492175 PO# 854954";

    string pattern = @"([VIR]|PO)\#\s([a-zA-Z0-9]*)";

    Regex regex = new Regex(pattern);

    MatchCollection mc = regex.Matches(input);

    foreach (Match m in mc)

    {

    if (m.Groups.Count == 3)

    {

    string type = m.Groups[1].Value;

    string value = m.Groups[2].Value;

    if (type == "V")

    {

    MessageBox.Show(string.Format("Vendor: {0}", value));

    }

    else if (type == "I")

    {

    MessageBox.Show(string.Format("Invoice: {0}", value));

    }

    else if (type == "PO")

    {

    MessageBox.Show(string.Format("PONum: {0}", value));

    }

    else if (type == "R")

    {

    MessageBox.Show(string.Format("Return: {0}", value));

    }

    else

    {

    MessageBox.Show(string.Format("Error! Unknown type: {0}", type));

    }

    }

    }

    }

    catch (Exception ex)

    {

    MessageBox.Show(ex.Message);

    }

    }

    If you want to do this with SQL you can use this:

    [font="Courier New"]IF (OBJECT_ID('t_test') IS NOT NULL)

       DROP TABLE t_test

    CREATE TABLE t_test (txt VARCHAR(100))

    GO

    INSERT INTO t_test VALUES ('V# H019978 I# 95492175 R# 854954')

    INSERT INTO t_test VALUES ('V# H020945 PO# 852194 ACCRUAL')

    GO

    IF (OBJECT_ID('udf_split_reference') IS NOT NULL)

       DROP FUNCTION udf_split_reference

    GO

    CREATE FUNCTION udf_split_reference (@prefix VARCHAR(10), @txt VARCHAR(100))

       RETURNS VARCHAR(100)

    AS

    BEGIN

       DECLARE @pos INT

       DECLARE @len INT

       SET @pos = CHARINDEX(@prefix, @txt) + LEN(@prefix) + 1

       SET @len = CHARINDEX(' ', @txt, @pos)

      

       IF (@len = 0)

          SET @len = LEN(@txt) - @pos + 1

       ELSE

          SET @len = @len - @pos

       RETURN SUBSTRING(@txt, @pos, @len)

    END

    GO

    SELECT *,

          dbo.udf_split_reference('V# ', txt) Vendor,

          dbo.udf_split_reference('I# ', txt) Invoice,

          dbo.udf_split_reference('R#', txt) Ret,

          dbo.udf_split_reference('PO#', txt) PONum

       FROM t_test

    [/font]

    Greets

    Flo

  • If the formatting of each element is pretty much cast in stone, then a straight forward method may be in order...

    Here's the test data...

    --===== Create a demonstration table (This is NOT part of the solution)

    IF OBJECT_ID('TempDB..#DemoTable') IS NOT NULL

    DROP TABLE #DemoTable

    CREATE Table #DemoTable (Line VARCHAR(200))

    INSERT INTO #DemoTable (Line)

    SELECT 'V# H019978 I# 95492175 R# 854954' UNION ALL

    SELECT 'V# H020945 PO# 852194 ACCRUAL' UNION ALL

    SELECT 'I# 95492175 R# 854954 V# H019978 ' UNION ALL

    SELECT 'PO# 852194 I# 95492175 R# 854954 V# H019978 ' UNION ALL

    SELECT 'V# H0123'

    ... and here's the parsing code...

    ;WITH

    cteStart AS

    (--==== Find the starting position of each element and add a trailing delimiter

    -- to the data

    SELECT NULLIF(CHARINDEX('V# ' ,Line),0)+3 AS VendorStart,

    NULLIF(CHARINDEX('I# ' ,Line),0)+3 AS InvoiceStart,

    NULLIF(CHARINDEX('R# ' ,Line),0)+3 AS ReturnStart,

    NULLIF(CHARINDEX('PO# ',Line),0)+4 AS PoStart,

    Line + ' ' AS Line

    FROM #DemoTable

    )--==== Split the element positions

    SELECT SUBSTRING(Line,VendorStart ,CHARINDEX(' ',Line,VendorStart)-VendorStart) AS Vendor,

    SUBSTRING(Line,InvoiceStart,CHARINDEX(' ',Line,InvoiceStart)-InvoiceStart) AS Invoice,

    SUBSTRING(Line,PoStart ,CHARINDEX(' ',Line,PoStart)-PoStart) AS PoNum,

    SUBSTRING(Line,ReturnStart ,CHARINDEX(' ',Line,ReturnStart)-ReturnStart) AS [Return]

    FROM cteStart

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply