February 27, 2018 at 4:25 am
Hi
To be honest I wasn't sure how to phrase this subject, I hope it makes sense after I describes my needs. I have a file(s) that need to be imported where the format is a single column of data where each row is a data field. The example below is an illustration as the actual layout is more complex, there's headers and field definitions but this should provide enough info.
The first 2 characters, this File Row ID, identify the content for the rest of the data, e.g. character 3 >. For example, this File Row ID '00' means this is the Start of the RecordID, with the content defining the type of action (Insert,Update,Delete). All subsequent rows all relate to that '00' row until the next '00' row is hit.
Using the sample data below , we have the following File Row IDs
00 - Starting Row and Action to Take - Insert/Update/Delete
01 - Primary Key Value
12 - First Name
03 - Middle Name
A4 - Last Name
00I
011570
12Mary
03Jane
A4Smith
00I
01501564
12Bob
A4Jones
00D
019045654
00U
014547
03Harry
On a side note, is a there a name for this type of file format? Would make googling easier.
While trying to avoid RBAR, I would like add a column with a 'group id' that would tell me which rows belong to which group. So I would end up with something like:
100I
1011570
112Mary
103Jane
1A4Smith
200I
201501564
212Bob
2A4Jones
300D
3019045654
400U
4014547
403Harry
I was thinking along the lines of ROW_NUMBER OVER PARTITION BY , but there is nothing to group by. Is this possible with a record set process or am I looking at looping at such.
I should add this for SQL Server Express so SSIS is not an option.
The following creates a sample data set. I already have it to the point where I am separating the File Row ID CREATE TABLE FileData (FileRowID CHAR(2) NOT NULL, FileRowData VARCHAR(20) NULL, GroupID INT NULL);
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','I');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','1570');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('12','Mary');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('03','Jane');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('A4','Smith');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','I');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','501564');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('12','Bob');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('A4','Jones');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','D');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','9045654');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('00','U');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('01','4547');
INSERT INTO FileData (FileRowID,FileRowData) VALUES ('03','Harry');
Any and all thought welcome.
Cheers,
Steve
February 27, 2018 at 5:55 am
You need to use a technique which maintains the physical order of the rows in the file in order to be able to add any sort of grouping identifier.
The rows in your Filedata table have no order as it stands.
To get the data into a table, while maintaining the ability to select it in the same order as in the file, without using SSIS, there is a bcp technique detailed in this thread.
Once you have this, the grouping query is not difficult to write, though I do not have time to go into detail right now.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
February 27, 2018 at 6:17 am
Thanks Phil, makes sense though have to admit that even with a row number column I think the SQL is beyond me but I will give it a go.
Cheers
Steve
February 27, 2018 at 6:19 am
Have an identity column in your staging table - to keep track of original row numbers in the fileCREATE TABLE FileData (RowNo int IDENTITY(1,1) PRIMARY KEY NOT NULL, FileRowID CHAR(2) NOT NULL, FileRowData VARCHAR(20) NULL, GroupID INT NULL);
And use that BULK INSERT technic referenced by Phil
_____________
Code for TallyGenerator
February 27, 2018 at 6:29 am
Thanks Sergiy. When I said SQL I should have said the Grouping Query. I am guessing since I now have a ID I can find the 'Next Starting record' and work from there. Do that in a SELECT is what makes my head swim 😉
Cheers,
Steve
February 27, 2018 at 6:47 am
SteveD SQL - Tuesday, February 27, 2018 6:29 AMThanks Sergiy. When I said SQL I should have said the Grouping Query. I am guessing since I now have a ID I can find the 'Next Starting record' and work from there. Do that in a SELECT is what makes my head swim 😉
Cheers,
Steve
Steve, please provide another version of your FileData DDL and sample data inserts which includes a populated integer column called RowNo.
Then it should be easy enough for others to use that to build the query you are looking for.
Though wouldn't it be better to have a query which had (PK, Action, FirstName, MiddleName, LastName) as its resultset?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
February 27, 2018 at 7:08 am
Heya Phil,
New code below.
You're right of course, ultimately I will be going for the select to pivot the columns but I am just trying to get head around this step first, e.g. adding the grouping id. All the data column might include more than one piece based on its code and again was just trying to keep things simple at first. Me'mind function better at simple 😉
CREATE TABLE FileData
(
RowNo INT IDENTITY(1, 1) PRIMARY KEY NOT NULL
,FileRowID CHAR(2) NOT NULL
,FileRowData VARCHAR(20) NULL
,GroupID INT NULL
);
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'I');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '1570');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('12', 'Mary');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('03', 'Jane');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('A4', 'Smith');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'I');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '501564');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('12', 'Bob');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('A4', 'Jones');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'D');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '9045654');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('00', 'U');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('01', '4547');
INSERT INTO dbo.FileData (FileRowID, FileRowData) VALUES ('03', 'Harry');
Cheers,
Steve
February 27, 2018 at 8:40 am
Here's one way to find the groups. You could also use COUNT and leave off the ELSE clause.
SELECT *, SUM(CASE WHEN FileRowID = '00' THEN 1 ELSE 0 END) OVER(ORDER BY RowNo ROWS UNBOUNDED PRECEDING) AS Grp
FROM FileData
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
February 27, 2018 at 9:02 am
Try running this
CREATE TABLE #FileData
(
RowNo INT IDENTITY(1, 1) PRIMARY KEY NOT NULL
, FileRowID CHAR(2) NOT NULL
, FileRowData VARCHAR(20) NULL
, GroupID INT NULL
);
INSERT #FileData
(
FileRowID
, FileRowData
)
VALUES
(
'00', 'I'
)
,(
'01', '1570'
)
,(
'12', 'Mary'
)
,(
'03', 'Jane'
)
,(
'A4', 'Smith'
)
,(
'00', 'I'
)
,(
'01', '501564'
)
,(
'12', 'Bob'
)
,(
'A4', 'Jones'
)
,(
'00', 'D'
)
,(
'01', '9045654'
)
,(
'00', 'U'
)
,(
'01', '4547'
)
,(
'03', 'Harry'
);
DECLARE @Counter INT;
SELECT @Counter = 0;
UPDATE #FileData
SET @Counter=GroupID = (CASE
WHEN FileRowID = '00' THEN
@Counter + 1
ELSE
@Counter
END
)
FROM #FileData WITH (TABLOCKX)
OPTION (MAXDOP 1);
SELECT *
FROM #FileData fd;
SELECT
fd.GroupID
, Action = MAX(IIF(fd.FileRowID = '00', fd.FileRowData, NULL))
, PK = MAX(IIF(fd.FileRowID = '01', fd.FileRowData, NULL))
, FirstName = MAX(IIF(fd.FileRowID = '12', fd.FileRowData, NULL))
, MiddleName = MAX(IIF(fd.FileRowID = '03', fd.FileRowData, NULL))
, LastName = MAX(IIF(fd.FileRowID = 'A4', fd.FileRowData, NULL))
FROM #FileData fd
GROUP BY fd.GroupID;
Note that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
February 27, 2018 at 9:59 am
Phil Parkin - Tuesday, February 27, 2018 9:02 AMNote that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.
I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
February 27, 2018 at 10:12 am
drew.allen - Tuesday, February 27, 2018 9:59 AMPhil Parkin - Tuesday, February 27, 2018 9:02 AMNote that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew
Your caution is sensible, but what did I miss?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
February 27, 2018 at 11:12 am
Phil Parkin - Tuesday, February 27, 2018 10:12 AMdrew.allen - Tuesday, February 27, 2018 9:59 AMPhil Parkin - Tuesday, February 27, 2018 9:02 AMNote that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew
Your caution is sensible, but what did I miss?
I didn't say that YOU missed something. This was more a comment for anyone reading this thread that might be inclined to try the "quirky update".
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
February 27, 2018 at 3:15 pm
Assuming '00' row is mandatory for every group, this should work:
if OBJECT_ID('tempdb..#FileRowGroups') is not null DROP TABLE #FileRowGroups
select IDENTITY(int, 1,1) GroupID, F.RowNo GroupStart, ISNULL(FL.GroupEnd, [LastRow]) GroupEnd
INTO #FileRowGroups
from FileData F
cross join (select top 1 RowNo [LastRow] FROM FileData order by RowNo DESC) TR
OUTER APPLY (
select top 1 RowNo - 1 GroupEnd
from FileData LN
where LN.FileRowID = '00'
and LN.RowNo > F.RowNo
order by LN.RowNo
) FL
WHERE FileRowID = '00'
ORDER BY F.RowNo
select * from #FileRowGroups
Also, this piece could be useful performance wise:create index FileRowID on FileData(FileRowID)
The rest should be easy.
_____________
Code for TallyGenerator
February 27, 2018 at 5:55 pm
drew.allen - Tuesday, February 27, 2018 11:12 AMPhil Parkin - Tuesday, February 27, 2018 10:12 AMdrew.allen - Tuesday, February 27, 2018 9:59 AMPhil Parkin - Tuesday, February 27, 2018 9:02 AMNote that it uses a technique called the 'quirky update' to update the group number, which relies on there being a CLUSTERED PRIMARY KEY on the IDENTITY column.I would avoid using the "quirky update" if there is another performant solution available. There are a lot of factors that go into getting the quirky update right and it's too easy to miss something.
Drew
Your caution is sensible, but what did I miss?
I didn't say that YOU missed something. This was more a comment for anyone reading this thread that might be inclined to try the "quirky update".
Drew
You should learn how to use it properly before making such recommendations. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2018 at 11:18 pm
Thanks to all for your help. I will be spending time today just working through the options.
I should say that considering the complexity of the files and the potential for millions of rows, I have talked to the client about having some C# components for this stuff.
Regardless of the direction the project take getting the data into SQL server at this level allows for some deeper analysis of the contents. Also I have learned some very useful techniques so thanks again!
Cheers
Steve
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply