March 14, 2018 at 7:44 am
Morning Guys,
I have a CSV file that is not playing fair. Which I have attached (the database has been sanitized and trimmed down to 3 lines of actual data) along with a format file that I do or don't want to use. Whilst not against the format file, for what should be fairly simple it seems unnecessary.
The source csv consists of 19 rows of garbage header followed by the actual data, therefore i specify that the firstrow=20, skipping all the crap at the start of the file.
The source is unix and i have confirmed with notepad++ that the row terminates with a LF. Of which I have tried all variations of \r, \n, \r\n, and 0x01a to no avail.
The error is consistently : "The bulk load failed. The column is too long in the data file for row 1, column 1."
All the fields in the destination table are varchar(max), as follows:
create table frTest
(
f1 nvarchar(max) null,
f2 nvarchar(max) null,
f3 nvarchar(max) null,
f4 nvarchar(max) null,
f5 nvarchar(max) null,
f6 nvarchar(max) null,
f7 nvarchar(max) null,
f8 nvarchar(max) null,
f9 nvarchar(max) null,
f10 nvarchar(max) null,
f11 nvarchar(max) null,
f12 nvarchar(max) null,
f13 nvarchar(max) null,
f14 nvarchar(max) null,
f15 nvarchar(max) null,
f16 nvarchar(max) null,
f17 nvarchar(max) null,
f18 nvarchar(max) null,
f19 nvarchar(max) null,
f20 nvarchar(max) null,
f21 nvarchar(max) null,
f22 nvarchar(max) null,
f23 nvarchar(max) null,
f24 nvarchar(max) null,
f25 nvarchar(max) null,
f26 nvarchar(max) null
)
In terms of what I have tried so far:
BULK INSERT frTest
FROM 'c:\temp\input.csv'
WITH
(
FIRSTROW = 20,
FORMATFILE = 'c:\temp\fr-import-formatfile-n.fmt'
);
BULK INSERT frTest
FROM 'c:\temp\input.csv'
WITH
(
FIELDTERMINATOR =';',
ROWTERMINATOR = ';\n', -- and plenty of variations around this \n\r, \r\n and the hex to no avail
FIRSTROW = 20,
FORMATFILE = 'c:\temp\fr-import-formatfile-n.fmt'
);
And I am running out of hair to pull out now! So any help would keep my head warm, please.
Cheers
Alex
March 14, 2018 at 9:11 am
I notice in your format file you have all 26 columns set to have a prefix length of 8. If I set them all to 0, I am able to load the file without error.
https://docs.microsoft.com/en-us/sql/relational-databases/import-export/non-xml-format-files-sql-server
After importing, it looks like the first row is actually column headers, so there's only 2 rows of actual data in the sample file correct?
March 14, 2018 at 10:10 am
That's correct, about 17-18 rows of header stamp from the source, followed by a header row, followed by data... 2 rows.
When I run either of the t-sql I posted, after alter the datatype size to 0 in the fmt file I get zero rows affected. No errors. Just 0 rows.
March 14, 2018 at 10:27 am
To load your sample file, the only other change I made was in the BULK INSERT statement, changed FIRSTROW = 2. For using this on the actual production file yes you would use FIRSTROW = 20 if there are 18 header stamp and 1 row column names.
March 14, 2018 at 11:08 am
if I open the input.txt file in notepad++ it looks like this:
In notepad though it only has 3-4 lines because its not breaking on the LF.
Oh, is that my problem -- I am specifying start from line 20 but its not seeing a line 20 as the first 19 lines above actually appear on one line under notepad... Because of the lack of CR?
March 14, 2018 at 11:11 am
Hmmm, the above comment from me, seems to be what I am seeing. If I specify row 2 as the starting row then it loads the rest in fine (from a quick glance).
But that data appears to be line 20 when in Notepad++ that I suppose honours the LF as a newline.
Urgh!
March 14, 2018 at 12:02 pm
OK, wait, I think I see the problem, it may be related to the number of columns in the first "rows" I think maybe it's trying to read 26 columns before the first EOL, so the first real data row gets lost when I set FIRSTROW = 2, it seems to think the junk rows are part of 1 big row?
March 15, 2018 at 2:40 am
Chris, yes, that ties in with what I am seeing.
This is a hassle!
I am trying to avoid having to clean the file before hand, but its not looking great 🙁
March 15, 2018 at 10:32 am
So, to conclude: I ended up cleaning the file using powershell by extracting only the lines starting with the token that I needed and writing them to a new file
get-childitem "c:\temp\fr\" -filter *.txt | select-string -pattern '^Activ' | select -expandproperty line | set-content "c:\temp\fr\output.txt"
This also made the import a bit easier downstream by combining multiple source files into a single output.txt which is then consumed by the bulk insert.
Cheers all for your help with this.
Alex
March 15, 2018 at 11:12 am
alex.sqldba - Wednesday, March 14, 2018 7:44 AMMorning Guys,I have a CSV file that is not playing fair. Which I have attached (the database has been sanitized and trimmed down to 3 lines of actual data) along with a format file that I do or don't want to use. Whilst not against the format file, for what should be fairly simple it seems unnecessary.
The source csv consists of 19 rows of garbage header followed by the actual data, therefore i specify that the firstrow=20, skipping all the crap at the start of the file.
The source is unix and i have confirmed with notepad++ that the row terminates with a LF. Of which I have tried all variations of \r, \n, \r\n, and 0x01a to no avail.
The error is consistently : "The bulk load failed. The column is too long in the data file for row 1, column 1."
All the fields in the destination table are varchar(max), as follows:
create table frTest
(
f1 nvarchar(max) null,
f2 nvarchar(max) null,
f3 nvarchar(max) null,
f4 nvarchar(max) null,
f5 nvarchar(max) null,
f6 nvarchar(max) null,
f7 nvarchar(max) null,
f8 nvarchar(max) null,
f9 nvarchar(max) null,
f10 nvarchar(max) null,
f11 nvarchar(max) null,
f12 nvarchar(max) null,
f13 nvarchar(max) null,
f14 nvarchar(max) null,
f15 nvarchar(max) null,
f16 nvarchar(max) null,
f17 nvarchar(max) null,
f18 nvarchar(max) null,
f19 nvarchar(max) null,
f20 nvarchar(max) null,
f21 nvarchar(max) null,
f22 nvarchar(max) null,
f23 nvarchar(max) null,
f24 nvarchar(max) null,
f25 nvarchar(max) null,
f26 nvarchar(max) null
)In terms of what I have tried so far:
BULK INSERT frTest
FROM 'c:\temp\input.csv'
WITH
(
FIRSTROW = 20,
FORMATFILE = 'c:\temp\fr-import-formatfile-n.fmt'
);BULK INSERT frTest
FROM 'c:\temp\input.csv'
WITH
(
FIELDTERMINATOR =';',
ROWTERMINATOR = ';\n', -- and plenty of variations around this \n\r, \r\n and the hex to no avail
FIRSTROW = 20,
FORMATFILE = 'c:\temp\fr-import-formatfile-n.fmt'
);And I am running out of hair to pull out now! So any help would keep my head warm, please.
Cheers
Alex
The trouble with Bulk Insert and similar things in T-SQL is that they are NOT "line sensitive". They are "DELIMITER sensitive". Since you defined a row as having 25 semicolons and a line feed, the FIRSTROW=20 thing isn't working the way you think it is.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 15, 2018 at 11:13 am
For anyone else that wants to see what I mean, here's what the file looks like in NotePad++.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 15, 2018 at 11:35 am
That's right it's not. But seeing as I was specifying what the row delimiter was I assumed that FIRSTROW=20 would be ultimately the 20th linefeed; and that it would break it up based on that.
March 15, 2018 at 1:34 pm
alex.sqldba - Thursday, March 15, 2018 11:35 AMThat's right it's not. But seeing as I was specifying what the row delimiter was I assumed that FIRSTROW=20 would be ultimately the 20th linefeed; and that it would break it up based on that.
The "row delimiter" is just the "last" delimiter and does NOT actually specify the end of a line. The thing that specifies a "row" is the number and type of field delimiters AND that supposed "row delimiter", which is actually just the last "field delimiter" and given a different incorrect name to screw with people's minds. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2018 at 12:13 am
Mind=Blown.
I thought the row delimiter WAS the LF. Or even at a push the field delimiter + the row delimiter.
I need to lay down whilst working out how to ban CSV's from this office.
March 16, 2018 at 6:04 am
alex.sqldba - Friday, March 16, 2018 12:13 AMMind=Blown.I thought the row delimiter WAS the LF. Or even at a push the field delimiter + the row delimiter.
I need to lay down whilst working out how to ban CSV's from this office.
It's not CSVs that should be banned. Compared to XML and JSON, CSV's (and their kin, TSV's) are fairly compact and nasty fast. The trouble is with the like's of BCP and BULK INSERT. And, just like XML and JSON, they do need to be "Well Formed". Just like a table in a database, they do need to be "normalized". The trouble occurs when they're not, as in when the rows are inconsistent. Although the likes of BCP and BULK INSERT work quite well on both, they were never designed to be a panacea for poorly formed data as in files with mixed row content like the one you posted.
There IS a way to work this file using either but, just like poorly formed XML or JSON, you do have to do something different to overcome the "devils in the data". You've found one of those ways by pre-processing the file with PowerShell to remove the non-normalized header rows from the file prior to attempting the import.
A method I've used in the past is to load the file as a single column blob and either split the data in place (starting with the first row after the header) or export the data (like you're doing with PowerShell) to a file without the non-normalized rows and then re-import.
The key is that it's not CSVs or other delimited files that should be banned. Don't ban great tools just because of the misunderstanding of the tools by the data providers. Like any other tool, people need to know how to create them correctly and, when they don't, then people have to do a work around like you did. Me? I'd be on the provider of the data like white on rice to normalize the data into a "Manifest" file with all that non-normalized header stuff (including the name of the related data file and it's number of rows, etc) and the normalized data file.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply