March 3, 2020 at 2:39 pm
I don't really feel like I should be defending myself here, but
'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P'
is 57 characters long, the length of all those columns is 80 characters. The OP never actually said (at least prior to my reply Jeff quoted)) that it was fixed width, and supplying a 57 character length string, and saying that they have an 80 character length one didn't help that.This information was drip fed, and had to be worked out through intuition/guess work, and it certainly wasn't obvious when I made the reply that Jeff quoted; which was my first reply on the thread.
The OP does, in a later reply, say that the value is actually
'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P'
, however, that string is also only 74 characters long; though helps identify that we are likely more on the path of fixed width.
I think the file is the CIF "Timetable Information Data Feed Interface Specification"
Specifications here. https://www.raildeliverygroup.com/files/Publications/services/rsp/RSPS5046_timetable_information_data_feed_interface_specification.pdf
The OP will have to split lines differently depending on the "Record Identity" (first two characters of each line).
It's probably a job best performed by an external program, for example, .NET FileIO has a FixedWidth file reader which when given the format will automatically split a line into an array containing all the required substrings. It shouldn't take more than a couple of minutes to read in the file, then added to that would be the time taken to insert 8.4 million rows into the database, which if done in bulk inserts shouldn't take more than a few minutes.
March 3, 2020 at 2:58 pm
The original question was how to import the data without having to use SUBSTRING to do it. The answer there is to use a BCP Format File along with BULK INSERT.
The rest of it is extra. As we now know, the OP is trying to distribute the data found in a single file to multiple hopefully normalized tables and we can't help there because we simply don't know enough about those tables. Like you, though, I'd be interested in what the OP changed to speed up his existing process.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 3, 2020 at 3:08 pm
Sorry... Missed Jonathan's post, which appears to be spot on. Jonathan, how did you figure out it was a rail schedule and where to get the documentation on it? Is this something you've dealt with before?
And, I agree... a simple BCP format file isn't going to work so well for this because of the varying unit record layouts within the same file. I'm not sure that I'd write external code for it, but that fixed width read DOES sound mighty interesting. Sounds like a BCP Format File on steroids.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 3, 2020 at 3:57 pm
Sorry... Missed Jonathan's post, which appears to be spot on. Jonathan, how did you figure out it was a rail schedule and where to get the documentation on it? Is this something you've dealt with before?
And, I agree... a simple BCP format file isn't going to work so well for this because of the varying unit record layouts within the same file. I'm not sure that I'd write external code for it, but that fixed width read DOES sound mighty interesting. Sounds like a BCP Format File on steroids.
What gave it away was columns like trainuid, trainservicecode. I used Google search to find the specifications. I have seen and used CIF files before. CIF format is historically the main way timetable information was distributed across the British railways.
March 3, 2020 at 4:04 pm
After 50 replies, I believe forum rules officially declare this a post hijack 😉
For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
March 3, 2020 at 4:42 pm
Here's an XML approach.
declare @t table
(
RowID int identity,
line varchar(80)
)
insert into @t (line)
select 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' union
select 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q'
;
with
pd as
(
select
row_number() over (partition by x.RowID order by (select (null))) as colID,
x.RowID,
ca.val
from
(
select
RowID,
cast('<X>' + replace(t.line, ' ', '</X><X>') + '</X>' as xml) as xmlCol
from
@t as t
)
as x
cross apply
(
select
xrow.D.value('.', 'varchar(max)') as val
from
x.xmlCol.nodes('X') as xrow(D)
)
as ca
)
select
*
from
pd
pivot
(
max(val)
for
colID in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
)
as p
March 3, 2020 at 5:16 pm
Here's an XML approach.
declare @t table
(
RowID int identity,
line varchar(80)
)
insert into @t (line)
select 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' union
select 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q'
;
with
pd as
(
select
row_number() over (partition by x.RowID order by (select (null))) as colID,
x.RowID,
ca.val
from
(
select
RowID,
cast('<X>' + replace(t.line, ' ', '</X><X>') + '</X>' as xml) as xmlCol
from
@t as t
)
as x
cross apply
(
select
xrow.D.value('.', 'varchar(max)') as val
from
x.xmlCol.nodes('X') as xrow(D)
)
as ca
)
select
*
from
pd
pivot
(
max(val)
for
colID in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
)
as p
I think you missed the main conversation, the OP is not using he spaces as seperators. its a fixed length string that needs to be broken up in chunks based on character position
but I do like some of the ideas in your code - a little over engineered, but I can see uses for it
MVDBA
March 3, 2020 at 11:02 pm
The XML approach is one of the slowest mostly because of "ExpAnsive Concatenation".
--Jeff Moden
Change is inevitable... Change for the better is not.
March 4, 2020 at 10:40 am
Thanks to one and all for your replies
I have mange to re-code my script and able to import the 8.4 million rows into their respective tables in 6 - 7 minutes, but it still uses substring to place the data into their tables. So the real problems was down to IO throughput rather than the substring processing.
March 4, 2020 at 2:01 pm
Andrew, can you confirm that it is the railway schedule CIF file(s) you're working with?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 4, 2020 at 2:04 pm
Jeff, yes it is the railway schedule data
March 4, 2020 at 2:29 pm
Thanks, Andrew. Sounds like you've got is sussed. What did you do to decrease the I/O problems that you were having?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 4, 2020 at 2:34 pm
Jeff,
basically I generated a work table with the basic information where the datasets begin and end and used direct insert into <table> select substring(fieldname,1,2) as recordident from basetable where recordid = 'BS' and between xxx and yyyy which are keyid value generated when the data is added to basetable. Theres also a index for the recordident to help things along.
Hope fully that made sense !!!
Andrew
March 4, 2020 at 2:47 pm
It does, indeed. Even if you're only extracting the "BS" record types, it might make sense to split out the first two characters as the record type during import using a BCP Format File and, of course, import the other 78 characters into a CHAR(78) column for the other processing you do. That might also speed up your identification of where each dataset begins and ends because it'll be an indexable column. You could skip the BCP Format File if you added a persisted computed column to extract the record type and simply inserted into a view that pointed to your staging table to not include that computed column or your dataset "identifier" column(s) that you create.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 14 posts - 61 through 73 (of 73 total)
You must be logged in to reply to this topic. Login to reply