Add Group ID value for set of sequential rows

  • Hi
    To be honest I wasn't sure how to phrase this subject, I hope it makes sense after I describes my needs. I have a file(s) that need to be imported where the format is a single column of data where each row is a data field. The example below is an illustration as the actual layout is more complex, there's headers and field definitions but this should provide enough info.

    The first 2 characters, this File Row ID, identify the content for the rest of the data, e.g. character 3 >. For example, this File Row ID '00' means this is the Start of the RecordID, with the content defining the type of action (Insert,Update,Delete). All subsequent rows all relate to that '00' row until the next '00' row is hit.

    Using the sample data below , we have the following File Row IDs
    00 - Starting Row and Action to Take - Insert/Update/Delete
    01 - Primary Key Value
    12 - First Name
    03 - Middle Name
    A4 - Last Name

    00I         
    011570         
    12Mary  
    03Jane  
    A4Smith 
    00I
    01501564    
    12Bob  
    A4Jones 
    00D
    019045654
    00U
    014547
    03Harry

    On a side note, is a there a name for this type of file format? Would make googling easier. 

    While trying to avoid RBAR, I would like add a column with a 'group id' that would tell me which rows belong to which group. So I would end up with something like:

    100I         
    1011570         
    112Mary  
    103Jane  
    1A4Smith 
    200I
    201501564    
    212Bob  
    2A4Jones 
    300D
    3019045654
    400U
    4014547
    403Harry

    I was thinking along the lines of ROW_NUMBER OVER PARTITION BY , but there is nothing to group by. Is this possible with a record set process or am I looking at looping at such.

    I should add this for SQL Server Express so SSIS is not an option.

    The following creates a sample data set. I already have it to the point where I am separating the File Row ID
       CREATE TABLE FileData (FileRowID CHAR(2) NOT NULL, FileRowData VARCHAR(20) NULL, GroupID INT NULL);
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','I');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','1570');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('12','Mary');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('03','Jane');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('A4','Smith');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','I');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','501564');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('12','Bob');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('A4','Jones');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','D');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','9045654');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','U');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','4547');         
        INSERT INTO FileData (FileRowID,FileRowData) VALUES ('03','Harry');         

    Any and all thought welcome.
    Cheers,
    Steve

  • You need to use a technique which maintains the physical order of the rows in the file in order to be able to add any sort of grouping identifier.
    The rows in your Filedata table have no order as it stands.
    To get the data into a table, while maintaining the ability to select it in the same order as in the file, without using SSIS, there is a bcp technique detailed in this thread.
    Once you have this, the grouping query is not difficult to write, though I do not have time to go into detail right now.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks Phil, makes sense though have to admit that even with a row number column I think the SQL is beyond me but I will give it a go.
    Cheers
    Steve

  • Have an identity column in your staging table - to keep track of original row numbers in the file
    CREATE TABLE FileData (RowNo int IDENTITY(1,1) PRIMARY KEY NOT NULL, FileRowID CHAR(2) NOT NULL, FileRowData VARCHAR(20) NULL, GroupID INT NULL);

    And use that BULK INSERT technic referenced by Phil

    _____________
    Code for TallyGenerator

  • Thanks Sergiy. When I said SQL I should have said the Grouping Query. I am guessing since I now have a ID I can find the 'Next Starting record' and work from there. Do that in a SELECT is what makes my head  swim 😉
    Cheers,
    Steve

  • SteveD SQL - Tuesday, February 27, 2018 6:29 AM

    Thanks Sergiy. When I said SQL I should have said the Grouping Query. I am guessing since I now have a ID I can find the 'Next Starting record' and work from there. Do that in a SELECT is what makes my head  swim 😉
    Cheers,
    Steve

    Steve, please provide another version of your FileData DDL and sample data inserts which includes a populated integer column called RowNo.
    Then it should be easy enough for others to use that to build the query you are looking for.

    Though wouldn't it be better to have a query which had (PK, Action, FirstName, MiddleName, LastName) as its resultset?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Heya Phil,
    New code below.
    You're right of course, ultimately I will be going for the select to pivot the columns but I am just trying to get head around this step first, e.g. adding the grouping id. All the data column might include more than one piece based on its code and again was just trying to keep things simple at first. Me'mind function better at simple 😉 

    CREATE TABLE FileData

    RowNo INT IDENTITY(1, 1) PRIMARY KEY NOT NULL
     ,FileRowID CHAR(2) NOT NULL
     ,FileRowData VARCHAR(20) NULL
     ,GroupID INT NULL
    );
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'I');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '1570');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('12', 'Mary');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('03', 'Jane');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('A4', 'Smith');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'I');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '501564');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('12', 'Bob');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('A4', 'Jones');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'D');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '9045654');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'U');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '4547');
    INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('03', 'Harry'); 
     
    Cheers,
    Steve

  • Here's one way to find the groups.  You could also use COUNT and leave off the ELSE clause.


    SELECT *, SUM(CASE WHEN FileRowID = '00' THEN 1 ELSE 0 END) OVER(ORDER BY RowNo ROWS UNBOUNDED PRECEDING) AS Grp 
    FROM FileData

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Try running this


    CREATE TABLE #FileData
    (
      RowNo   INT   IDENTITY(1, 1) PRIMARY KEY NOT NULL
    , FileRowID CHAR(2)  NOT NULL
    , FileRowData VARCHAR(20) NULL
    , GroupID  INT   NULL
    );

    INSERT #FileData
    (
      FileRowID
    , FileRowData
    )
    VALUES
    (
      '00', 'I'
    )
    ,(
      '01', '1570'
    )
    ,(
      '12', 'Mary'
    )
    ,(
      '03', 'Jane'
    )
    ,(
      'A4', 'Smith'
    )
    ,(
      '00', 'I'
    )
    ,(
      '01', '501564'
    )
    ,(
      '12', 'Bob'
    )
    ,(
      'A4', 'Jones'
    )
    ,(
      '00', 'D'
    )
    ,(
      '01', '9045654'
    )
    ,(
      '00', 'U'
    )
    ,(
      '01', '4547'
    )
    ,(
      '03', 'Harry'
    );

    DECLARE @Counter INT;

    SELECT @Counter = 0;

    UPDATE #FileData
    SET  @Counter=GroupID = (CASE
               WHEN FileRowID = '00' THEN
                @Counter + 1
               ELSE
                @Counter
             END
             )
    FROM #FileData WITH (TABLOCKX)
    OPTION (MAXDOP 1);

    SELECT *
    FROM #FileData fd;

    SELECT
       fd.GroupID
    ,   Action  = MAX(IIF(fd.FileRowID = '00', fd.FileRowData, NULL))
    ,   PK   = MAX(IIF(fd.FileRowID = '01', fd.FileRowData, NULL))
    ,   FirstName = MAX(IIF(fd.FileRowID = '12', fd.FileRowData, NULL))
    ,   MiddleName = MAX(IIF(fd.FileRowID = '03', fd.FileRowData, NULL))
    ,   LastName = MAX(IIF(fd.FileRowID = 'A4', fd.FileRowData, NULL))
    FROM  #FileData fd
    GROUP BY fd.GroupID;

    Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin - Tuesday, February 27, 2018 9:02 AM

    Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

    I would avoid using the "quirky update" if there is another performant solution available.  There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Tuesday, February 27, 2018 9:59 AM

    Phil Parkin - Tuesday, February 27, 2018 9:02 AM

    Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

    I would avoid using the "quirky update" if there is another performant solution available.  There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.

    Drew

    Your caution is sensible, but what did I miss?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin - Tuesday, February 27, 2018 10:12 AM

    drew.allen - Tuesday, February 27, 2018 9:59 AM

    Phil Parkin - Tuesday, February 27, 2018 9:02 AM

    Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

    I would avoid using the "quirky update" if there is another performant solution available.  There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.

    Drew

    Your caution is sensible, but what did I miss?

    I didn't say that YOU missed something.  This was more a comment for anyone reading this thread that might be inclined to try the "quirky update".

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Assuming '00' row is mandatory for every group, this should work:

    if OBJECT_ID('tempdb..#FileRowGroups') is not null DROP TABLE #FileRowGroups

    select IDENTITY(int, 1,1) GroupID, F.RowNo GroupStart, ISNULL(FL.GroupEnd, [LastRow]) GroupEnd
    INTO #FileRowGroups
    from FileData F
        cross join (select top 1 RowNo [LastRow] FROM FileData order by RowNo DESC) TR
        OUTER APPLY (
            select top 1 RowNo - 1 GroupEnd
            from FileData LN
            where LN.FileRowID = '00'
                and LN.RowNo > F.RowNo
            order by LN.RowNo
            ) FL

    WHERE FileRowID = '00'
    ORDER BY F.RowNo

    select * from #FileRowGroups

    Also, this piece could be useful performance wise:
    create index FileRowID on FileData(FileRowID)

    The rest should be easy.

    _____________
    Code for TallyGenerator

  • drew.allen - Tuesday, February 27, 2018 11:12 AM

    Phil Parkin - Tuesday, February 27, 2018 10:12 AM

    drew.allen - Tuesday, February 27, 2018 9:59 AM

    Phil Parkin - Tuesday, February 27, 2018 9:02 AM

    Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

    I would avoid using the "quirky update" if there is another performant solution available.  There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.

    Drew

    Your caution is sensible, but what did I miss?

    I didn't say that YOU missed something.  This was more a comment for anyone reading this thread that might be inclined to try the "quirky update".

    Drew

    You should learn how to use it properly before making such recommendations. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks to all for your help. I will be spending time today just working through the options. 
    I should say that considering the complexity of the files and the potential for millions of rows, I have talked to the client about having some C# components for this stuff.
    Regardless of the direction the project take getting the data into SQL server at this level allows for some deeper analysis of the contents. Also I have learned some very useful techniques so thanks again!  
    Cheers
    Steve

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply