September 21, 2016 at 3:27 am
Morning All,
I've received a flat file which needs to processed into a SQL table, the only problem being that the file has no column delimiters hence when imported it appears as a single column.
The only identifier that I have for each row is that the first row begins with "B12C....", other issue being that the number of columns varies row to row.
Below is a sample dataset
CREATE TABLE [dbo].[DataTEST](
[ID] [int] IDENTITY(1,1) NOT NULL,
Data [varchar](255) NULL
) ON [PRIMARY]
INSERT INTO dbo.[DataTEST]
(Data)
VALUES
('TaskID'),
('CustomeridID'),
('AccountID'),
('EnquiryName'),
('Source'),
('EventDate'),
('EventDays'),
('EnquiryCreatedDate'),
('Description'),
('b12c0000008gowM'),
('003b000001DolFS'),
('001b000003KtIjG'),
('Conference and Team building'),
('Word of Mouth'),
('08/07/2016'),
('0'),
('31/03/2016'),
('8th or 14th. £57ddr'),
('Edwards £3500 or Smithson £1500'),
('b12c0000008gomb'),
('003b000001STRE'),
('004r000004tes5'),
('Flight Training'),
('Dave Gorman'),
('08/09/2016'),
('0'),
('12/04/2016'),
('Dave to confrim numbers'),
('Currently using Suite 1')
September 21, 2016 at 3:34 am
I realised that i could use the lead function to process the next row as a separate column:
Select *
FROM
(
SELECT ID,
[Data],
--Concat('(''',[Data] Collate SQL_Latin1_General_CP1_CI_AS,'''),'),
--Case when [Data] like 'A12b%' then 1 else 0 END,
Lead ([Data],1,0) over (Order by ID) A,
Lead ([Data],2,0) over (Order by ID) b,
Lead ([Data],3,0) over (Order by ID)c,
Lead ([Data],4,0) over (Order by ID)d,
Lead ([Data],5,0) over (Order by ID)e,
Lead ([Data],6,0) over (Order by ID)f,
Lead ([Data],7,0) over (Order by ID)g,
Lead ([Data],8,0) over (Order by ID)h,
Lead ([Data],9,0) over (Order by ID)i
FROM dbo.[DataTEST]
)A
where [Data] like 'B12c%'
September 21, 2016 at 3:59 am
Although this works in your example, what is a "column" is blank. Does the file contain a blank line, or does it go straight column? Just want to make sure you don't run into any unexpected errors.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
September 21, 2016 at 4:18 am
Here is a quick suggestion for filtering the data rows from the set
😎
IF OBJECT_ID(N'dbo.DataTEST') IS NOT NULL DROP TABLE dbo.DataTEST;
CREATE TABLE [dbo].[DataTEST](
[ID] [int] IDENTITY(1,1) NOT NULL,
Data [varchar](255) NULL
) ON [PRIMARY]
INSERT INTO dbo.[DataTEST]
(Data)
VALUES
('TaskID'),
('CustomeridID'),
('AccountID'),
('EnquiryName'),
('Source'),
('EventDate'),
('EventDays'),
('EnquiryCreatedDate'),
('Description'),
('b12c0000008gowM'),
('003b000001DolFS'),
('001b000003KtIjG'),
('Conference and Team building'),
('Word of Mouth'),
('08/07/2016'),
('0'),
('31/03/2016'),
('8th or 14th. £57ddr'),
('Edwards £3500 or Smithson £1500'),
('b12c0000008gomb'),
('003b000001STRE'),
('004r000004tes5'),
('Flight Training'),
('Dave Gorman'),
('08/09/2016'),
('0'),
('12/04/2016'),
('Dave to confrim numbers'),
('Currently using Suite 1');
DECLARE @NUM_DATA_ROWS INT = 10;
;WITH BASE_DATA AS
(
SELECT
DT.ID
,DT.Data
,SUM(CASE
WHEN CHARINDEX('b12c',DT.Data,1) = 1 THEN 1
ELSE 0
END) OVER
(
ORDER BY DT.ID
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS GROUP_ID
FROM dbo.DataTEST DT
)
,NUMBERED_LINE_SET AS
(
SELECT
BD.ID
,ROW_NUMBER() OVER
(
PARTITION BY BD.GROUP_ID
ORDER BY BD.GROUP_ID
) AS LINE_RID
,BD.GROUP_ID
,BD.Data
FROM BASE_DATA BD
WHERE BD.GROUP_ID > 0
)
SELECT
NLS.ID
,NLS.GROUP_ID
,NLS.LINE_RID
,NLS.Data
FROM NUMBERED_LINE_SET NLS
WHERE NLS.LINE_RID <= @NUM_DATA_ROWS
;
Output from the sample data
ID GROUP_ID LINE_RID Data
----------- ----------- -------------------- --------------------------------
10 1 1 b12c0000008gowM
11 1 2 003b000001DolFS
12 1 3 001b000003KtIjG
13 1 4 Conference and Team building
14 1 5 Word of Mouth
15 1 6 08/07/2016
16 1 7 0
17 1 8 31/03/2016
18 1 9 8th or 14th. £57ddr
19 1 10 Edwards £3500 or Smithson £1500
20 2 1 b12c0000008gomb
21 2 2 003b000001STRE
22 2 3 004r000004tes5
23 2 4 Flight Training
24 2 5 Dave Gorman
25 2 6 08/09/2016
26 2 7 0
27 2 8 12/04/2016
28 2 9 Dave to confrim numbers
29 2 10 Currently using Suite 1
September 21, 2016 at 4:26 am
This will pivot the data sets to columns
😎
IF OBJECT_ID(N'dbo.DataTEST') IS NOT NULL DROP TABLE dbo.DataTEST;
CREATE TABLE [dbo].[DataTEST](
[ID] [int] IDENTITY(1,1) NOT NULL,
Data [varchar](255) NULL
) ON [PRIMARY]
INSERT INTO dbo.[DataTEST]
(Data)
VALUES
('TaskID'),
('CustomeridID'),
('AccountID'),
('EnquiryName'),
('Source'),
('EventDate'),
('EventDays'),
('EnquiryCreatedDate'),
('Description'),
('b12c0000008gowM'),
('003b000001DolFS'),
('001b000003KtIjG'),
('Conference and Team building'),
('Word of Mouth'),
('08/07/2016'),
('0'),
('31/03/2016'),
('8th or 14th. £57ddr'),
('Edwards £3500 or Smithson £1500'),
('b12c0000008gomb'),
('003b000001STRE'),
('004r000004tes5'),
('Flight Training'),
('Dave Gorman'),
('08/09/2016'),
('0'),
('12/04/2016'),
('Dave to confrim numbers'),
('Currently using Suite 1');
DECLARE @NUM_DATA_ROWS INT = 10;
;WITH BASE_DATA AS
(
SELECT
DT.ID
,DT.Data
,SUM(CASE
WHEN CHARINDEX('b12c',DT.Data,1) = 1 THEN 1
ELSE 0
END) OVER
(
ORDER BY DT.ID
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS GROUP_ID
FROM dbo.DataTEST DT
)
,NUMBERED_LINE_SET AS
(
SELECT
BD.ID
,ROW_NUMBER() OVER
(
PARTITION BY BD.GROUP_ID
ORDER BY BD.GROUP_ID
) AS LINE_RID
,BD.GROUP_ID
,BD.Data
FROM BASE_DATA BD
WHERE BD.GROUP_ID > 0
)
SELECT
NLS.GROUP_ID
--,NLS.LINE_RID
,MAX(CASE WHEN NLS.LINE_RID = 1 THEN NLS.Data END ) AS COL_001
,MAX(CASE WHEN NLS.LINE_RID = 2 THEN NLS.Data END ) AS COL_002
,MAX(CASE WHEN NLS.LINE_RID = 3 THEN NLS.Data END ) AS COL_003
,MAX(CASE WHEN NLS.LINE_RID = 4 THEN NLS.Data END ) AS COL_004
,MAX(CASE WHEN NLS.LINE_RID = 5 THEN NLS.Data END ) AS COL_005
,MAX(CASE WHEN NLS.LINE_RID = 6 THEN NLS.Data END ) AS COL_006
,MAX(CASE WHEN NLS.LINE_RID = 7 THEN NLS.Data END ) AS COL_007
,MAX(CASE WHEN NLS.LINE_RID = 8 THEN NLS.Data END ) AS COL_008
,MAX(CASE WHEN NLS.LINE_RID = 9 THEN NLS.Data END ) AS COL_009
,MAX(CASE WHEN NLS.LINE_RID = 10 THEN NLS.Data END ) AS COL_010
FROM NUMBERED_LINE_SET NLS
WHERE NLS.LINE_RID <= @NUM_DATA_ROWS
GROUP BY NLS.GROUP_ID
;
Output from the sample data
GROUP_ID COL_001 COL_002 COL_003 COL_004 COL_005 COL_006 COL_007 COL_008 COL_009 COL_010
----------- ---------------- ---------------- ---------------- ----------------------------- -------------- ----------- -------- ----------- ------------------------ --------------------------------
1 b12c0000008gowM 003b000001DolFS 001b000003KtIjG Conference and Team building Word of Mouth 08/07/2016 0 31/03/2016 8th or 14th. £57ddr Edwards £3500 or Smithson £1500
2 b12c0000008gomb 003b000001STRE 004r000004tes5 Flight Training Dave Gorman 08/09/2016 0 12/04/2016 Dave to confrim numbers Currently using Suite 1
September 21, 2016 at 4:59 am
A quick and simple solution with good old joins.
Select
CustomeridID.Data as CustomeridID ,
AccountID.Data as AccountID,
EnquiryName.Data as EnquiryName,
Source.Data as Source,
EventDate.Data as EventDate,
EventDays.Data as EventDays,
EnquiryCreatedDate.Data as EnquiryCreatedDate,
Description.Data as Description
FROM dbo.[DataTEST] d
join dbo.[DataTEST] CustomeridID on CustomeridID.ID= d.id+1
join dbo.[DataTEST] AccountID on AccountID.ID= d.id+2
join dbo.[DataTEST] EnquiryName on EnquiryName.ID= d.id+3
join dbo.[DataTEST] Source on Source.ID= d.id+4
join dbo.[DataTEST] EventDate on EventDate.ID= d.id+5
join dbo.[DataTEST] EventDays on EventDays.ID= d.id+6
join dbo.[DataTEST] EnquiryCreatedDate on EnquiryCreatedDate.ID= d.id+7
join dbo.[DataTEST] Description on Description.ID= d.id+8
and (d.id % 10)=0
September 21, 2016 at 6:38 am
joeroshan (9/21/2016)
A quick and simple solution with good old joins.
Select
CustomeridID.Data as CustomeridID ,
AccountID.Data as AccountID,
EnquiryName.Data as EnquiryName,
Source.Data as Source,
EventDate.Data as EventDate,
EventDays.Data as EventDays,
EnquiryCreatedDate.Data as EnquiryCreatedDate,
Description.Data as Description
FROM dbo.[DataTEST] d
join dbo.[DataTEST] CustomeridID on CustomeridID.ID= d.id+1
join dbo.[DataTEST] AccountID on AccountID.ID= d.id+2
join dbo.[DataTEST] EnquiryName on EnquiryName.ID= d.id+3
join dbo.[DataTEST] Source on Source.ID= d.id+4
join dbo.[DataTEST] EventDate on EventDate.ID= d.id+5
join dbo.[DataTEST] EventDays on EventDays.ID= d.id+6
join dbo.[DataTEST] EnquiryCreatedDate on EnquiryCreatedDate.ID= d.id+7
join dbo.[DataTEST] Description on Description.ID= d.id+8
and (d.id % 10)=0
Careful there, this is very inefficient and the plan has multiple table scans, (num columns + 1) or in your query total of 9 scans.
😎
September 21, 2016 at 7:33 am
Can you start by sharing a sample of the file? It might be possible to do it during the import, but I can't be sure without a sample.
September 21, 2016 at 8:04 am
Eirikur Eiriksson (9/21/2016)
Careful there, this is very inefficient and the plan has multiple table scans, (num columns + 1) or in your query total of 9 scans.
😎
This can be made sargable with a clustered index on id.
Create clustered index IX_DataTESTID on DataTEST(ID)
However the mod operator still would need a scan.
September 21, 2016 at 9:31 am
Thanks for all your responses guys.
Regarding the multiple join table, IO is one of the things that we try our hardest to limit as a development team and having reviewed Eirikur Eiriksson's solution, i found this to work best with the full dataset.
That being said I'm humbled by everyone's positive responses and contributions.
September 21, 2016 at 10:00 am
😎
Multiple joins need not mean you have a bad performance. I tried with some test data with some sample by executing
INSERT INTO dbo.[DataTEST]
Select data from
dbo.[DataTEST] where ID >9 order by id
Go 15
With 65000 rows the join was faster, especially after creating the clustered index.
All said, glad that your problem is solved.
Let us know if you face any challenges later.
September 21, 2016 at 3:04 pm
waxb18 (9/21/2016)
Morning All,I've received a flat file which needs to processed into a SQL table, the only problem being that the file has no column delimiters hence when imported it appears as a single column.
The only identifier that I have for each row is that the first row begins with "B12C....", other issue being that the number of columns varies row to row.
Below is a sample dataset
CREATE TABLE [dbo].[DataTEST](
[ID] [int] IDENTITY(1,1) NOT NULL,
Data [varchar](255) NULL
) ON [PRIMARY]
INSERT INTO dbo.[DataTEST]
(Data)
VALUES
('TaskID'),
('CustomeridID'),
('AccountID'),
('EnquiryName'),
('Source'),
('EventDate'),
('EventDays'),
('EnquiryCreatedDate'),
('Description'),
('b12c0000008gowM'),
('003b000001DolFS'),
('001b000003KtIjG'),
('Conference and Team building'),
('Word of Mouth'),
('08/07/2016'),
('0'),
('31/03/2016'),
('8th or 14th. £57ddr'),
('Edwards £3500 or Smithson £1500'),
('b12c0000008gomb'),
('003b000001STRE'),
('004r000004tes5'),
('Flight Training'),
('Dave Gorman'),
('08/09/2016'),
('0'),
('12/04/2016'),
('Dave to confrim numbers'),
('Currently using Suite 1')
Why are there only 9 heading rows when there are 10 rows of data in each group? I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.
It doesn't look like there's any PII in that data. Would it be possible to attach the file? If you compress the file, please use a ZIP file instead of RAR or any other compression method. It will also limit IO to the very minimum of only occurring during the import and won't be more than what your current import incurs.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 22, 2016 at 3:41 am
Jeff Moden (9/21/2016)
I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.
One should be really good at a tool to know when not to use it :-D. Its always great learning from you Jeff.
September 22, 2016 at 5:48 am
joeroshan (9/22/2016)
Jeff Moden (9/21/2016)
I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.One should be really good at a tool to know when not to use it :-D. Its always great learning from you Jeff.
Thanks Joe,
Just to explain a bit more, if the number of header lines equal the number of data lines consistently and the data lines are consistently in the same order as the header lines, then the file actually does have delimiters in the form of end-of-line characters for each. That can easily be defined in a BCP format file and the "pivot" will auto-magically be accomplished during the input. BCP/BULK INSERT is NOT relegated to 1 line per row.
If the file (as it sometimes does) has a blank line between the groups of items, that can be easily incorporated into the BCP format file delimiter for the last "field".
I frequently do the same with the output of PowerShell because the column widths are unreliable when delivered in a horizontal format but is very reliable for number of items per group and the order of the items when returned in a vertical format. A simple split to remove the embedded column header on each line aftwards (I normally import to a table that has a persisted computed column for each item makes it instantly complete on import. You can even do a bit of data-typing in such computed columns.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 22, 2016 at 8:17 am
Thanks a lot for the detailed explanation Jeff. Looks like I need to play around with some flat files this weekend. Hope the waxb18 still watching the thread and would benefit from your suggestion.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply