Add Group ID value for set of sequential rows

Question

Add Group ID value for set of sequential rows

SteveD SQL

SSCrazy

Points: 2261
More actions
February 27, 2018 at 4:25 am

#336343

Hi
To be honest I wasn't sure how to phrase this subject, I hope it makes sense after I describes my needs. I have a file(s) that need to be imported where the format is a single column of data where each row is a data field. The example below is an illustration as the actual layout is more complex, there's headers and field definitions but this should provide enough info.
The first 2 characters, this File Row ID, identify the content for the rest of the data, e.g. character 3 >. For example, this File Row ID '00' means this is the Start of the RecordID, with the content defining the type of action (Insert,Update,Delete). All subsequent rows all relate to that '00' row until the next '00' row is hit.
Using the sample data below , we have the following File Row IDs
00 - Starting Row and Action to Take - Insert/Update/Delete
01 - Primary Key Value
12 - First Name
03 - Middle Name
A4 - Last Name
00I
011570
12Mary
03Jane
A4Smith
00I
01501564
12Bob
A4Jones
00D
019045654
00U
014547
03Harry

On a side note, is a there a name for this type of file format? Would make googling easier.
While trying to avoid RBAR, I would like add a column with a 'group id' that would tell me which rows belong to which group. So I would end up with something like:
100I
1011570
112Mary
103Jane
1A4Smith
200I
201501564
212Bob
2A4Jones
300D
3019045654
400U
4014547
403Harry
I was thinking along the lines of ROW_NUMBER OVER PARTITION BY , but there is nothing to group by. Is this possible with a record set process or am I looking at looping at such.
I should add this for SQL Server Express so SSIS is not an option.
The following creates a sample data set. I already have it to the point where I am separating the File Row ID
   CREATE TABLE FileData (FileRowID CHAR(2) NOT NULL, FileRowData VARCHAR(20) NULL, GroupID INT NULL);     INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','I');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','1570');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('12','Mary');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('03','Jane');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('A4','Smith');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','I');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','501564');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('12','Bob');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('A4','Jones');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','D');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','9045654');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','U');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','4547');              INSERT INTO FileData (FileRowID,FileRowData) VALUES ('03','Harry');
Any and all thought welcome.
Cheers,
Steve

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246991 More actions · Answer 1

You need to use a technique which maintains the physical order of the rows in the file in order to be able to add any sort of grouping identifier.
The rows in your Filedata table have no order as it stands.
To get the data into a table, while maintaining the ability to select it in the same order as in the file, without using SSIS, there is a bcp technique detailed in this thread.
Once you have this, the grouping query is not difficult to write, though I do not have time to go into detail right now.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

SteveD SQL SSCrazy Points: 2261 More actions · Answer 2

Thanks Phil, makes sense though have to admit that even with a row number column I think the SQL is beyond me but I will give it a go.
Cheers
Steve

Sergiy SSC Guru Points: 110208 More actions · Answer 3

Have an identity column in your staging table - to keep track of original row numbers in the file
CREATE TABLE FileData (RowNo int IDENTITY(1,1) PRIMARY KEY NOT NULL, FileRowID CHAR(2) NOT NULL, FileRowData VARCHAR(20) NULL, GroupID INT NULL);

And use that BULK INSERT technic referenced by Phil

_____________
Code for TallyGenerator

SteveD SQL SSCrazy Points: 2261 More actions · Answer 4

Thanks Sergiy. When I said SQL I should have said the Grouping Query. I am guessing since I now have a ID I can find the 'Next Starting record' and work from there. Do that in a SELECT is what makes my head swim 😉
Cheers,
Steve

Phil Parkin SSC Guru Points: 246991 More actions · Answer 5

SteveD SQL - Tuesday, February 27, 2018 6:29 AM
Thanks Sergiy. When I said SQL I should have said the Grouping Query. I am guessing since I now have a ID I can find the 'Next Starting record' and work from there. Do that in a SELECT is what makes my head swim 😉
Cheers,
Steve

Steve, please provide another version of your FileData DDL and sample data inserts which includes a populated integer column called RowNo.
Then it should be easy enough for others to use that to build the query you are looking for.

Though wouldn't it be better to have a query which had (PK, Action, FirstName, MiddleName, LastName) as its resultset?

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

SteveD SQL SSCrazy Points: 2261 More actions · Answer 6

Heya Phil,
New code below.
You're right of course, ultimately I will be going for the select to pivot the columns but I am just trying to get head around this step first, e.g. adding the grouping id. All the data column might include more than one piece based on its code and again was just trying to keep things simple at first. Me'mind function better at simple 😉

CREATE TABLE FileData ( RowNo INT IDENTITY(1, 1) PRIMARY KEY NOT NULL ,FileRowID CHAR(2) NOT NULL ,FileRowData VARCHAR(20) NULL ,GroupID INT NULL ); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'I'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '1570'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('12', 'Mary'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('03', 'Jane'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('A4', 'Smith'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'I'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '501564'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('12', 'Bob'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('A4', 'Jones'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'D'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '9045654'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'U'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '4547'); INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('03', 'Harry');
Cheers,
Steve

drew.allen SSC Guru Points: 76998 More actions · Answer 7

Here's one way to find the groups. You could also use COUNT and leave off the ELSE clause.

SELECT *, SUM(CASE WHEN FileRowID = '00' THEN 1 ELSE 0 END) OVER(ORDER BY RowNo ROWS UNBOUNDED PRECEDING) AS Grp FROM FileData

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Phil Parkin SSC Guru Points: 246991 More actions · Answer 8

Try running this

CREATE TABLE #FileData ( RowNo INT IDENTITY(1, 1) PRIMARY KEY NOT NULL , FileRowID CHAR(2) NOT NULL , FileRowData VARCHAR(20) NULL , GroupID INT NULL );

INSERT #FileData ( FileRowID , FileRowData ) VALUES ( '00', 'I' ) ,( '01', '1570' ) ,( '12', 'Mary' ) ,( '03', 'Jane' ) ,( 'A4', 'Smith' ) ,( '00', 'I' ) ,( '01', '501564' ) ,( '12', 'Bob' ) ,( 'A4', 'Jones' ) ,( '00', 'D' ) ,( '01', '9045654' ) ,( '00', 'U' ) ,( '01', '4547' ) ,( '03', 'Harry' );

DECLARE @Counter INT;

SELECT @Counter = 0;

UPDATE #FileData SET @Counter=GroupID = (CASE WHEN FileRowID = '00' THEN @Counter + 1 ELSE @Counter END ) FROM #FileData WITH (TABLOCKX) OPTION (MAXDOP 1);

SELECT * FROM #FileData fd;

SELECT fd.GroupID , Action = MAX(IIF(fd.FileRowID = '00', fd.FileRowData, NULL)) , PK = MAX(IIF(fd.FileRowID = '01', fd.FileRowData, NULL)) , FirstName = MAX(IIF(fd.FileRowID = '12', fd.FileRowData, NULL)) , MiddleName = MAX(IIF(fd.FileRowID = '03', fd.FileRowData, NULL)) , LastName = MAX(IIF(fd.FileRowID = 'A4', fd.FileRowData, NULL)) FROM #FileData fd GROUP BY fd.GroupID;

Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

drew.allen SSC Guru Points: 76998 More actions · Answer 9

Phil Parkin - Tuesday, February 27, 2018 9:02 AM
Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.

I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Phil Parkin SSC Guru Points: 246991 More actions · Answer 10

drew.allen - Tuesday, February 27, 2018 9:59 AM
Phil Parkin - Tuesday, February 27, 2018 9:02 AM
Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.
I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew

Your caution is sensible, but what did I miss?

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

drew.allen SSC Guru Points: 76998 More actions · Answer 11

Phil Parkin - Tuesday, February 27, 2018 10:12 AM
drew.allen - Tuesday, February 27, 2018 9:59 AM
Phil Parkin - Tuesday, February 27, 2018 9:02 AM
Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.
I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew
Your caution is sensible, but what did I miss?

I didn't say that YOU missed something. This was more a comment for anyone reading this thread that might be inclined to try the "quirky update".

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Sergiy SSC Guru Points: 110208 More actions · Answer 12

Assuming '00' row is mandatory for every group, this should work:

if OBJECT_ID('tempdb..#FileRowGroups') is not null DROP TABLE #FileRowGroups

select IDENTITY(int, 1,1) GroupID, F.RowNo GroupStart, ISNULL(FL.GroupEnd, [LastRow]) GroupEnd INTO #FileRowGroups from FileData F cross join (select top 1 RowNo [LastRow] FROM FileData order by RowNo DESC) TR OUTER APPLY ( select top 1 RowNo - 1 GroupEnd from FileData LN where LN.FileRowID = '00' and LN.RowNo > F.RowNo order by LN.RowNo ) FL

WHERE FileRowID = '00' ORDER BY F.RowNo

select * from #FileRowGroups

Also, this piece could be useful performance wise:
create index FileRowID on FileData(FileRowID)

The rest should be easy.

_____________
Code for TallyGenerator

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 13

drew.allen - Tuesday, February 27, 2018 11:12 AM
Phil Parkin - Tuesday, February 27, 2018 10:12 AM
drew.allen - Tuesday, February 27, 2018 9:59 AM
Phil Parkin - Tuesday, February 27, 2018 9:02 AM
Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.
I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew
Your caution is sensible, but what did I miss?
I didn't say that YOU missed something. This was more a comment for anyone reading this thread that might be inclined to try the "quirky update".
Drew

You should learn how to use it properly before making such recommendations. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SteveD SQL SSCrazy Points: 2261 More actions · Answer 14

Thanks to all for your help. I will be spending time today just working through the options.
I should say that considering the complexity of the files and the potential for millions of rows, I have talked to the client about having some C# components for this stuff.
Regardless of the direction the project take getting the data into SQL server at this level allows for some deeper analysis of the contents. Also I have learned some very useful techniques so thanks again!
Cheers
Steve