Import ridiculously large CSV file

Question

Post reply

Import ridiculously large CSV file

pietlinden

SSC Guru

Points: 63535
More actions
November 12, 2023 at 1:11 am

#4325994
First off, I know how to spell BCP, and that's about it. I'm trying to import a stupid huge CSV file into an existing table, and I basically get out of memory exception errors. Do I have to use SSIS to do this so that the data can be chunked and imported a section of rows at at time? (The file is like 700MB or bigger). I created a terrible table with terrible constraints because I just wanted to see if I could get the thing to import at all (yes, I know this is a terrible way to create a table):
```
CREATE TABLE DimProduct5 (
ProductKey VARCHAR(100) default null,
ProductName VARCHAR(25) default null,
VendorKey CHAR(10) default null,
BusinessArea VARCHAR(20) default null,
Category VARCHAR(50) default null,
Division VARCHAR(50) default null,
Department VARCHAR(50) default null,
[Group] VARCHAR(50) default null,
SubGroup VARCHAR(50) default null,
PrimaryBarcode VARCHAR(50) default null, -- 10
ProductNumber CHAR(10) default null,
[Description] VARCHAR(50) default null,
SeasonCode VARCHAR(20) default null,
RetailStandardPriceIncVAT VARCHAR(100) default null,
CurrentUnitCostPrice VARCHAR(50) default null,
ExitPricingStance  VARCHAR(50) default null,
RangeTheme  VARCHAR(50) default null,
Brand  VARCHAR(50) default null,
VendorProductRef  VARCHAR(50) default null,
GLPostingGroup VARCHAR(50) default null,
SalesTaxRate   VARCHAR(50) default null,
SalesTaxGroup VARCHAR(50) default null,
PurchaseTaxRate VARCHAR(50) default null,
PurchaseTaxGroup VARCHAR(50) default null,
BNMBrand VARCHAR(50) default null,
UnitMovingAverageCostWeighted  VARCHAR(50) default null
);
go
```
```
BULK INSERT OSPRetail.dbo.DimProduct5
FROM 'C:\Users\User\Downloads\data feed.csv'
WITH ( FORMAT = 'CSV',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n' );
```
I still get this error:
Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
Or should I be using SSIS for this?
(Really I want to do this in Azure, but one problem at a time!)
- This topic was modified 1 year, 3 months ago by pietlinden.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

frederico_fonseca SSCoach Points: 16196 More actions · Answer 1

whenever I see these errors on BCP its one of the following

invalid CSV file - e.g. not Double quoted on "fields" that contains double quotes or CR or LF
sizes for some columns bigger than definition
number of columns larger than definition
not all rows contain same number of columns

issue were never around size of the file - and 700mb is a small file

one possible way for you to validate that it is a valid csv file is to use Powershell

import-csv "C:\Users\User\Downloads\data feed.csv" |export-csv "C:\Users\User\Downloads\data feed_valid.csv" -NoTypeInformation

Assuming it works, look at the output number of columns and ensure it has the same ones.

and if file has less than 1 million rows, open in excel and check max size of each column as well.

regarding table - it is ok to define it like that for a Staging table - NOT for a final table, so I would not call it dimxxx

your process should load the data into a staging table, validate it/convert datatypes as required, and then load into final table.

pietlinden SSC Guru Points: 63535 More actions · Answer 2

Oh, sorry. Left a lot of stuff out. The files are getting imported into PowerBI (not in Fabric, or whatever MSFT calls it these days), and they do all the parsing/cleaning etc there. I think their final output is clean. (Sorry, I just started this gig two weeks ago, and last week my supervisor wasn't around, so I'm a bit lost.) What I would like to do is basically copy & paste the at least reasonably working PowerQuery into a step in Azure Data Factory or whatever, and land it instead into a staging table, so that I can create a composite model in PowerBI and the performance of this thing will not be horrendous like it is now. (They have a matrix on one page that takes like a minute to render - partly because it has like 8 measures in it, and partly because it has a product hierarchy that's like 10 levels deep.)

Oh, and they for some reason had ProductName from the DimProduct in the fact table. Not a big deal when you have hundreds of millions of rows in your fact table, right?

Whoa. I just dug in a bit further, and there are columns in more than one fact table that should only be in the dimension tables. Big problem. (That would explain in part why their model is so slooowwwww. (I think there are maybe 9 million rows in their Sales fact table, and you'd think there were orders of magnitude more.)

Should I just use something like Bravo to figure out what this schema should look like before doing anything else? I just feel like I'm building on sand if I don't basically go back to first principles. (And the problem I'm trying to solve is "this report is slow".) So kind of start over with the ERD and make sure there are only numeric columns and foreign keys in the fact table, and that the dimension tables don't have anything crazy in them.

How long should the PowerShell trick take? (Just wondering... hasn't crashed, but it looks like it's parked there... if it matters, I'm using the PoSh ISE.) It's been going for like 10 minutes already.

FWIW, here's the PowerQuery that's grabbing the data (the blob isn't called "MyStore", but the rest is the same):

let
    Source = AzureStorage.Blobs("MyStore"),
    #"MyStore-sales-data1" = Source{[Name="MyStore-sales-data"]}[Data],
    #"https://MyStore blob core windows net/MyStore-sales-data/_MyStore Data Files/DimProduct csv" = #"MyStore-sales-data1"{[#"Folder Path"="https://MyStore.blob.core.windows.net/MyStore-sales-data/",Name="MyStore Data Files/DimProduct.csv"]}[Content],
    #"Imported CSV" = Csv.Document(#"https://MyStore blob core windows net/MyStore-sales-data/_MyStore Data Files/DimProduct csv",[Delimiter=",", Columns=27, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]),
    #"Removed Duplicates" = Table.Distinct(#"Promoted Headers", {"ProductKey"}),
    #"Removed Blank Rows" = Table.SelectRows(#"Removed Duplicates", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Blank Rows",{{"UnitMovingAverageCostWeighted", Currency.Type}, {"PurchaseTaxRate", type number}, {"SalesTaxRate", type number}, {"UnitMovingAverageCost", type number}, {"CurrentUnitCostPrice", type number}, {"RetailStandardPriceIncVat", Currency.Type}}),
    #"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type", {"ProductKey"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Errors",{{"ProductKey", Int64.Type}}),
    #"Removed Errors1" = Table.RemoveRowsWithErrors(#"Changed Type1", {"ProductKey"}),
    #"Sorted Rows" = Table.Sort(#"Removed Errors1",{{"ProductKey", Order.Ascending}}),
    #"Replaced Value" = Table.ReplaceValue(#"Sorted Rows","NULL","Blank",Replacer.ReplaceText,{"SeasonCode"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","","Blank",Replacer.ReplaceValue,{"SeasonCode"}),
    #"Replaced Value2" = Table.ReplaceValue(#"Replaced Value1","0","Blank",Replacer.ReplaceText,{"SeasonCode"})
in
    #"Replaced Value2"

Seems to me that the data, at the very least, should be redirected to at least a staging table like you said, and then write some simple queries to create the Fact and Dimension tables. THEN I can likely use Aggregations to get this to speed up but right now, since there's no database behind this stuff, performance is awful.

"I'm all lost in the supermarket

I can no longer shop happily

I came in here for that special offer..."

This reply was modified 1 year, 3 months ago by pietlinden.

Jeff Moden SSC Guru Points: 1004498 More actions · Answer 3

This minor change (IIRC)...

BULK INSERT OSPRetail.dbo.DimProduct5
FROM 'C:\Users\User\Downloads\data feed.csv'
WITH ( FORMAT = 'CSV',
FIELDTERMINATOR = ',',
ROWTERMINATOR = 0X0A);

Also, read up a bit on the BULK INSERT and learn the value of the 'TABLOCK" setting... if you're NOT in the FULL Recovery model and the target table is empty, it'll allow for some nasty fast "Minimal Logging", even if the target table has a Clustered Index on it. To ensure that it's empty, use TRUNCATE and not DELETE. DELETE does NOT do the same resets that TRUNCATE does and I'm not talking about IDENTITY columns or other things like that here.

My disclaimer... I have to look at the "Good Book of Boy Howdy" every time I use BULK INSERT and so I don't know if all of it is correct.

Also, does the first row of the table contain column names?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

pietlinden SSC Guru Points: 63535 More actions · Answer 4

Can I do the whole import without using the Wizard, which creates (I think) and SSIS task? Because that crashes with Out of Memory errors?

For grins, (and because I'm used to it), I used Abscess to do a dummy import and get a quick table definition. Then I built that in SQL Server. Then I tried the import. And I got the memory errors.

I tried it with a crazy tiny piece of the CSV file (like the top 20 records), and it worked a champ. Can I create a linked server or something to the CSV file and import it in chunks? The other thing is that this is only one of many 700MB files.

This whole mess should *really* be happening in Azure, but I'm not sure what I'm doing, so I'm testing locally first. Should I need large amounts of RAM to do this on my local machine?

This reply was modified 1 year, 3 months ago by pietlinden.

frederico_fonseca SSCoach Points: 16196 More actions · Answer 5

first - while you mentioned you had "memory errors" the errors you supplied had nothing to do with memory, but only with a file format error when using the bulk insert - and for that I gave you something to look at.

and after you got back to us with something totally out of line with regards to your original issue - that does not help anyone to give you answers.

and from the looks of it you have a existing process, implemented in a way that may be less than desirable, and you are asking for random help on things you are trying to do to potentially replace some of the steps - this really does not work like that.

you need to sit down with your team members/managers and decide what you are going to do with the WHOLE process - from start to end - not just a "fix this particular report" when that report does NOT, from what you said, use a proper Database.

Jeff Moden SSC Guru Points: 1004498 More actions · Answer 6

We "talked" on LinkedIn. I provided a link to "BULK INSERT" and introduced "BatchSize" and "Tablock" to you. You also found "Example f.", which is about how to do things with Azure.

If this is something that must be done on a regular basis (and it appears that will be so), they should have a machine that can handle it all at once. The file sizes are less than 1GB an BULK INSERT should be able to handle that easily.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

pietlinden SSC Guru Points: 63535 More actions · Answer 7

Okay, I'm working on a smaller copy (just the top like 1000 rows), and it's working except everything is imported delimited by double quotes. (Nice and fast though...) After that I gotta change the field types of the table. (Sorry, one headache at a time)

TRUNCATE TABLE DataFeedSnippet3;

BULK INSERT
    CSVImportTest.dbo.DataFeedSnippet3
      FROM 'C:\Users\User\Downloads\data feed_valid.csv'
WITH
    (
BATCHSIZE = 10000,
  -- [ [ , ] CHECK_CONSTRAINTS ]
CODEPAGE = 'RAW'
  -- [ [ , ] DATAFILETYPE =
  --    { 'char' | 'native' | 'widechar' | 'widenative' } ]
   ,  FIRSTROW = 2
    , KEEPNULLS
    , KILOBYTES_PER_BATCH = 10000
   ,  LASTROW = 50000 
   ,  MAXERRORS = 100
, ROWS_PER_BATCH = 500
   ,  TABLOCK
   -- input file format options
   ,FORMAT = 'CSV'
   , FIELDQUOTE = ''
   , FIELDTERMINATOR = ','
   ,ROWTERMINATOR = '\n'
    );

Okay, the above worked. But when I tried it with a "proper" table design, like this:

CREATE TABLE TestFact4(
FY TINYINT,
FW TINYINT,
StoreKey CHAR(4),
ProductKey CHAR(6),
SalesUnits INT,
SalesNet MONEY,
CostAmount MONEY,
DiscountAmount MONEY,
TaxAmount MONEY,
SalesGross MONEY,
NetMargin MONEY,
LoyaltyDiscount DECIMAL(5,3),
PromoDiscount DECIMAL(5,3),
ReturnFlag BIT,
RecordID INT IDENTITY);

I get this error, so I guess it has something to do with the data type conversion somewhere... Oh, that means I should be using a format file... guess I gotta go figure out how to do that.

Msg 7301, Level 16, State 2, Line 24 Cannot obtain the required interface ("IID_IColumnsInfo") from OLE DB provider "BULK" for linked server "(null)".

This reply was modified 1 year, 3 months ago by pietlinden.

frederico_fonseca SSCoach Points: 16196 More actions · Answer 8

reason why you got "double quotes" is because you set the field quote to an empty string - try the basic import

e.g. remove the following 2 options

, FIELDQUOTE = ''

, FIELDTERMINATOR = ','

with regards to the second error - while you are testing ENSURE you do it onto a table with all fields set to varchar(xx) where xx should be big enough to hold the contents of each column.

DO NOT use the identity column as you have there - it would assume that your input file HAS that column - and it would also require that you use the option "KEEPIDENTITY" to work (which you don't have)

then you get into the point where you MAY be able to replace some of the datatypes with their correct one - but you are also likely to bump into cases where a apparently valid numeric value fails when you convert the datatype - this likely happens when you have spaces on the fields, or when you have the sign on the right of the value, or when you have thousands separator on the data .

my advise for these is always to load into a staging table with varchars, and then do another step to insert into the final table and do required validations/conversion of format.