Parsing data

  • I've imported a large flat file that I need to clean up.  Each record should begin with a 4 digit integer and then a string following the integer.  A new line would start with another 4 digit integer and string.  Currently the records look like this
    1375 Order includes
    All of the items you ordered plus
    a few
    other items that you did not order
    1376 There are no orders in this entry
    1377 We always shop for apples and
    oranges, but we do not like grapes

    I would like to get it to be 3 individual records starting with the integer followed by the string.

    1375 Order includes All of the items you ordered plus a few other items that you did not order
    1376 There are no orders in this entry
    1377 We always shop for apples and oranges, but we do not like grapes

  • jon.wilson - Friday, August 10, 2018 6:01 AM

    I've imported a large flat file that I need to clean up.  Each record should begin with a 4 digit integer and then a string following the integer.  A new line would start with another 4 digit integer and string.  Currently the records look like this
    1375 Order includes
    All of the items you ordered plus
    a few
    other items that you did not order
    1376 There are no orders in this entry
    1377 We always shop for apples and
    oranges, but we do not like grapes

    I would like to get it to be 3 individual records starting with the integer followed by the string.

    1375 Order includes All of the items you ordered plus a few other items that you did not order
    1376 There are no orders in this entry
    1377 We always shop for apples and oranges, but we do not like grapes

    OK, so is your question: "Can you assist me in writing a query to remove all rows which do not begin with "nnnn "? If not, what is your question?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Sorry, I would like to have the rows that don't begin with nnnn to be attached to the previous row.  The data source I recieved has a Item number (nnnn) and then a description.  However, most of the descriptions after the item number start on a new row or rows.  The next item starts with a new 4 digit integer.  So, the final output would be 
    1234 String with the description

  • jon.wilson - Friday, August 10, 2018 6:23 AM

    Sorry, I would like to have the rows that don't begin with nnnn to be attached to the previous row.  The data source I recieved has a Item number (nnnn) and then a description.  However, most of the descriptions after the item number start on a new row or rows.  The next item starts with a new 4 digit integer.  So, the final output would be 
    1234 String with the description

    If I had read your first post more carefully, I would have noticed that, sorry!

    How have you guaranteed that the rows are in the correct order after performing the import? Is there some identity column in your import table which can be used to ensure order?

    Or are you looking to modify the import itself, such that the rows are imported 'correctly' without requiring a subsequent clean-up?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Hopefully this will help, right now there are 7 records that need to be combined into 2 records each beginning with the 4 digit integer and the corresponding string.


    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    SET ANSI_PADDING ON
    GO
    CREATE TABLE [dbo].

    (
        [Column 0] [varchar](5000) NULL
    ) ON [PRIMARY]

    GO
    SET ANSI_PADDING OFF
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'1376 Beginning of string ''A''')
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'The rest of string ''A''')
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'1377 Part of string ''B''')
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'more of string ''B''')
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'even more of string ''B''')
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'still more of string ''B''')
    GO
    INSERT [dbo].

    ([Column 0]) VALUES (N'the last part of string ''B''')
    GO

  • The rows are in the correct order from what I can tell

  • The flat file has the data in the correct order, every record does begin with the 4 digit integer and then the description, but there are carrage returns at the end of many of the strings creating essentially a new record.

  • jon.wilson - Friday, August 10, 2018 6:46 AM

    The rows are in the correct order from what I can tell

    Tables in SQL Server have no natural order. If you have the time/inclination, you may find this interesting.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • jon.wilson - Friday, August 10, 2018 6:50 AM

    The flat file has the data in the correct order, every record does begin with the 4 digit integer and then the description, but there are carrage returns at the end of many of the strings creating essentially a new record.

    Understood. Here are two ways to fix this:

    1) Pre-process the file (use PoSh / C# / Python or whatever language you feel comfortable with) to put it in the desired format prior to import.

    2) Use SSIS to perform the import. A Script Component source would be able to read the file in its current form and output the file in its tidied form (after writing some code).

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Ideally, I would like to get this into 2 columns.  One for the 4 digit integer (which would actually be my key) and then the description

  • I'll take a look at these two paths, thanks for your suggestions

  • jon.wilson - Friday, August 10, 2018 6:56 AM

    I'll take a look at these two paths, thanks for your suggestions

    If you're not already familiar with SSIS, option 1 is likely to be the faster option, as there's a bit of a learning curve.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • There is a third option if you can add an identity column to your staging table.  Is that possible?

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Yes, the staging table can be anything right now

  • I've updated the staging table definition.  I used a temp table here, but you should use your regular table.

    CREATE TABLE #table(
        id INT IDENTITY,
        [Column 0] [varchar](5000) NULL
    ) ON [PRIMARY]

    INSERT #table ([Column 0])
    VALUES
        (N'1376 Beginning of string ''A''')
    ,   (N'The rest of string ''A''')
    ,   (N'1377 Part of string ''B''')
    ,   (N'more of string ''B''')
    ,   (N'even more of string ''B''')
    ,   (N'still more of string ''B''')
    ,   (N'the last part of string ''B''')
    ;

    WITH grouped AS
    (
        SELECT t.id, t.[Column 0], i.is_start, SUM(i.is_start)OVER(ORDER BY t.id ROWS UNBOUNDED PRECEDING) grp
        FROM #table t
        CROSS APPLY (VALUES(CASE WHEN t.[Column 0] LIKE '[0-9][0-9][0-9][0-9]%' THEN 1 ELSE 0 END)) i(is_start)
    )
    SELECT
    (
        SELECT sub.[Column 0] + ' '
        FROM grouped sub
        WHERE sub.grp = g.grp
        ORDER BY sub.id
        FOR XML PATH(''), TYPE
    ).value('.', 'VARCHAR(5000)') AS Column0
    FROM grouped g
    WHERE g.is_start = 1
    ORDER BY g.id

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply