Taking a single 80 character piece of Text and converting to columns in a table.

Question

Taking a single 80 character piece of Text and converting to columns in a table.

Viewing 14 posts - 61 through 74 (of 74 total)

You must be logged in to reply to this topic. Login to reply

Jonathan AC Roberts SSCoach Points: 18408 More actions · Answer 1

Thom A wrote:

I don't really feel like I should be defending myself here, but 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' is 57 characters long, the length of all those columns is 80 characters. The OP never actually said (at least prior to my reply Jeff quoted)) that it was fixed width, and supplying a 57 character length string, and saying that they have an 80 character length one didn't help that.
This information was drip fed, and had to be worked out through intuition/guess work, and it certainly wasn't obvious when I made the reply that Jeff quoted; which was my first reply on the thread.
The OP does, in a later reply, say that the value is actually 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P', however, that string is also only 74 characters long; though helps identify that we are likely more on the path of fixed width.

I think the file is the CIF "Timetable Information Data Feed Interface Specification"

Specifications here. https://www.raildeliverygroup.com/files/Publications/services/rsp/RSPS5046_timetable_information_data_feed_interface_specification.pdf

The OP will have to split lines differently depending on the "Record Identity" (first two characters of each line).

It's probably a job best performed by an external program, for example, .NET FileIO has a FixedWidth file reader which when given the format will automatically split a line into an array containing all the required substrings. It shouldn't take more than a couple of minutes to read in the file, then added to that would be the time taken to insert 8.4 million rows into the database, which if done in bulk inserts shouldn't take more than a few minutes.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 2

The original question was how to import the data without having to use SUBSTRING to do it. The answer there is to use a BCP Format File along with BULK INSERT.

The rest of it is extra. As we now know, the OP is trying to distribute the data found in a single file to multiple hopefully normalized tables and we can't help there because we simply don't know enough about those tables. Like you, though, I'd be interested in what the OP changed to speed up his existing process.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 3

Sorry... Missed Jonathan's post, which appears to be spot on. Jonathan, how did you figure out it was a rail schedule and where to get the documentation on it? Is this something you've dealt with before?

And, I agree... a simple BCP format file isn't going to work so well for this because of the varying unit record layouts within the same file. I'm not sure that I'd write external code for it, but that fixed width read DOES sound mighty interesting. Sounds like a BCP Format File on steroids.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18408 More actions · Answer 4

Jeff Moden wrote:

Sorry... Missed Jonathan's post, which appears to be spot on. Jonathan, how did you figure out it was a rail schedule and where to get the documentation on it? Is this something you've dealt with before?
And, I agree... a simple BCP format file isn't going to work so well for this because of the varying unit record layouts within the same file. I'm not sure that I'd write external code for it, but that fixed width read DOES sound mighty interesting. Sounds like a BCP Format File on steroids.

What gave it away was columns like trainuid, trainservicecode. I used Google search to find the specifications. I have seen and used CIF files before. CIF format is historically the main way timetable information was distributed across the British railways.

Mike01 SSChampion Points: 11420 More actions · Answer 5

After 50 replies, I believe forum rules officially declare this a post hijack 😉

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

ariski 64798 SSC Rookie Points: 37 More actions · Answer 6

Here's an XML approach.

declare @t table 
(
RowID int identity,
line varchar(80)
)

insert into @t (line)
select 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' union
select 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q'

;
with 
pd as 
(
select 
row_number() over (partition by x.RowID order by (select (null))) as colID,
x.RowID,
ca.val
from 
(
select 
RowID, 
cast('<X>' + replace(t.line, ' ', '</X><X>') + '</X>' as xml) as xmlCol
from 
@t as t
) 
as x
cross apply 
(
select 
xrow.D.value('.', 'varchar(max)') as val
from 
x.xmlCol.nodes('X') as xrow(D)
) 
as ca
)
select 
*
from 
pd
pivot
(
max(val) 
for 
colID in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
) 
as p

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 7

ariski 64798 wrote:

Here's an XML approach.

declare @t table 
(
RowID int identity,
line varchar(80)
)

insert into @t (line)
select 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' union
select 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q'

;
with 
pd as 
(
select 
row_number() over (partition by x.RowID order by (select (null))) as colID,
x.RowID,
ca.val
from 
(
select 
RowID, 
cast('<X>' + replace(t.line, ' ', '</X><X>') + '</X>' as xml) as xmlCol
from 
@t as t
) 
as x
cross apply 
(
select 
xrow.D.value('.', 'varchar(max)') as val
from 
x.xmlCol.nodes('X') as xrow(D)
) 
as ca
)
select 
*
from 
pd
pivot
(
max(val) 
for 
colID in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
) 
as p

I think you missed the main conversation, the OP is not using he spaces as seperators. its a fixed length string that needs to be broken up in chunks based on character position

but I do like some of the ideas in your code - a little over engineered, but I can see uses for it

MVDBA

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 8

The XML approach is one of the slowest mostly because of "ExpAnsive Concatenation".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Andrew Goffin-302397 SSC Veteran Points: 236 More actions · Answer 9

Thanks to one and all for your replies

I have mange to re-code my script and able to import the 8.4 million rows into their respective tables in 6 - 7 minutes, but it still uses substring to place the data into their tables. So the real problems was down to IO throughput rather than the substring processing.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 10

Andrew, can you confirm that it is the railway schedule CIF file(s) you're working with?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Andrew Goffin-302397 SSC Veteran Points: 236 More actions · Answer 11

Andrew Goffin-302397

SSC Veteran

Points: 236

March 4, 2020 at 2:04 pm

#3730857

Jeff, yes it is the railway schedule data

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

Thanks, Andrew. Sounds like you've got is sussed. What did you do to decrease the I/O problems that you were having?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Andrew Goffin-302397 SSC Veteran Points: 236 More actions · Answer 13

Jeff,

basically I generated a work table with the basic information where the datasets begin and end and used direct insert into <table> select substring(fieldname,1,2) as recordident from basetable where recordid = 'BS' and between xxx and yyyy which are keyid value generated when the data is added to basetable. Theres also a index for the recordident to help things along.

Hope fully that made sense !!!

Andrew

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 14

It does, indeed. Even if you're only extracting the "BS" record types, it might make sense to split out the first two characters as the record type during import using a BCP Format File and, of course, import the other 78 characters into a CHAR(78) column for the other processing you do. That might also speed up your identification of where each dataset begins and ends because it'll be an indexable column. You could skip the BCP Format File if you added a persisted computed column to extract the record type and simply inserted into a view that pointed to your staging table to not include that computed column or your dataset "identifier" column(s) that you create.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)