July 3, 2018 at 8:00 am
Thanks.
July 3, 2018 at 8:15 am
Stewart "Arturius" Campbell - Tuesday, July 3, 2018 6:24 AMssc 21649 - Tuesday, July 3, 2018 5:28 AMThis is cool!If a quote-delimited string looks like this:
"1313 Mockingbird Lane, Detroit, MI 48690"
(two commas), does this logic still work?
Same question for quote-delimited strings like this:"This is a ""test"" of parsing"
It does not matter how many comma's are encountered within the quotes, the logic will still handle it correctly.
even double-delimiting will work correctly.
Well until someone decides to put "," inside one of the fields, it's helpful but ultimately if you have to do this kind of preprocessing chances are you already lost.
July 3, 2018 at 8:22 am
Very nice script and useful. couple of questions
1) how could it show lines imported and records loaded to table for verification.
2) how about records that fail can it detect line number from the data file and why.
Thanks again
July 3, 2018 at 8:25 am
Jerry Kelk - Tuesday, July 3, 2018 2:43 AMThis has been a personal annoyance for years.
Surely, the import engine should properly respect the double quotes (or whatever specified character) as string delimiters, if so directed. We would not then have to hack at the user's text by removing commas or any such workarounds.
It would definitely be nice if MS would recognize the importance of CSV with respect to BCP and Bulk Insert. Instead, MS publishes that CSV isn't actually supported by either. I believe there's a "connect" item (I can't remember the touchy-feely name they now call it) on the subject so we might actually see such support sometime around 2035. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
July 3, 2018 at 8:59 am
Stewart "Arturius" Campbell - Monday, July 2, 2018 11:04 PMComments posted to this topic are about the item Importing Tricky CSV Files with PowerShell
Thanks for the code and taking time to share it in an article, Stewart.
One of the biggest problems I've seen occurs when people try to import CSVs that were created by spreadsheets. As you already know, BCP and BULK INSERT are NOT sensitive to "lines" of data in the file. They're sensitive to the type and number of delimiters. The delimiters MUST be consistent throughout the file, including "column headings" in the file. This kills the ability to easily do an input using BCP or BULK INSERT because EXCEL (and many other products) will only "text qualify" (with double quotes) only those elements that contain delimiters (such as commas) as a part of the intended text in the element.
There is a tool in PowerShell that can easily handle this and, under the covers, I'm sure that it operates pretty much as your good code does but at machine language speeds instead of the interpretive speeds of a loop in PowerShell. I don't have quantitative performance measurements on it (but will make the effort to provide that in the near future) but my anecdotal qualitative observation is that it's quite fast.
The method involves ImportCSV followed by an immediate ExportCSV. What it does is that it text qualifies (with double quotes) all elements based on the desired delimiter. It does not add extra qualifiers to any element (or column of elements) that have already been text qualified with quotes. The other cool part is that it also text qualifies a delimited header row if it exists.
After doing the ImportCSV/ExportCSV thing, the file is now 100% consistent for delimiters (","). The only "rub" is that the first double quote simply needs to be ignored and the last delimiter (what people refer as the "row terminator", which includes the trailing delimiter (double quote in this case) for the last element on the line from the file. That means that you must use a BCP format file, which isn't so bad because that also helps performance a bit.
The put-off for some is that Microsoft does not actually support CSV imports or exports using BCP or BULK INSERT. I suspect that may be a part of the reason why they don't include how to use double quotes as a part of a delimiter even though it has worked in all known versions of both BCP and BULK INSERT (use /" for those that want to know, especially in a BCP Format file). Because it's not supported, people fear that MS may someday change it. While that is a possibility, I just can't see MS doing that because I'm not sure that even the current generation of people that support the code are even aware that capability exists and would actually require a code change to remove it.
If they ever do remove the capability, there will be a very angry old man in full postal garb making an in person trip to Redmond with a serious 4 banded pork chop launcher and a freezer full of frozen "bone in" pork chops. 😉
Thank you again for this article. I wish more people would publish about how to handle CSVs, TSVs, and spreadsheets.
I also agree with ZZMartin about having to do this type of thing to begin with. Data is important and, if at all possible, the receivers of the data should make it painfully clear to the providers that the data is important and must be both easy to use consistent in form at all times. The providers really shouldn't have to be told this, either.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 3, 2018 at 9:52 am
I have been doing something similar for the last few months. Thank you a clear and easily understood script.
John Kingery
July 3, 2018 at 10:20 am
Jeff Moden - Tuesday, July 3, 2018 8:25 AMJerry Kelk - Tuesday, July 3, 2018 2:43 AMThis has been a personal annoyance for years.
Surely, the import engine should properly respect the double quotes (or whatever specified character) as string delimiters, if so directed. We would not then have to hack at the user's text by removing commas or any such workarounds.It would definitely be nice if MS would recognize the importance of CSV with respect to BCP and Bulk Insert. Instead, MS publishes that CSV isn't actually supported by either. I believe there's a "connect" item (I can't remember the touchy-feely name they now call it) on the subject so we might actually see such support sometime around 2035. 😉
I shall be long retired and nearing a wooden box by then. Enjoy.
July 3, 2018 at 10:24 am
Nice piece Stewart, thanks for this!
😎
I'll compare this with the methods I'm using and ping back with my findings.
July 3, 2018 at 1:06 pm
A non-visual terminator could be a good solution: \0 (non-visible null aka CHAR(0) ) could work... but certain systems may treat the null as the end of a string or file, which could be awkward (if you need to work with the CSV file in other systems).
\t (tab) might work if you don't have embedded tabs.
But it looks like our choice of terminators is limited: https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server?view=sql-server-2017
We cannot use arbitrary non-printable characters such as CHAR(28) which is traditionally FS or File separator ...which is too bad.
But we can use a "string of up to 10 printable characters, including some or all of the terminators listed earlier".
I haven't used it, but something like this .NET package could also be helpful for validating and pre-processing the CSV: https://joshclose.github.io/CsvHelper
July 3, 2018 at 1:16 pm
David Rueter - Tuesday, July 3, 2018 1:06 PMA non-visual terminator could be a good solution: \0 (non-visible null aka CHAR(0) ) could work... but certain systems may treat the null as the end of a string or file, which could be awkward (if you need to work with the CSV file in other systems).\t (tab) might work if you don't have embedded tabs.
But it looks like our choice of terminators is limited: https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server?view=sql-server-2017
We cannot use arbitrary non-printable characters such as CHAR(28) which is traditionally FS or File separator ...which is too bad.
But we can use a "string of up to 10 printable characters, including some or all of the terminators listed earlier".
I haven't used it, but something like this .NET package could also be helpful for validating and pre-processing the CSV: https://joshclose.github.io/CsvHelper
Hmm.... I've used some pretty odd terminators in BCP, right now we have a job using 0xDCBF as the column separator
July 3, 2018 at 6:41 pm
Bruin - Tuesday, July 3, 2018 8:22 AMVery nice script and useful. couple of questions
1) how could it show lines imported and records loaded to table for verification.
2) how about records that fail can it detect line number from the data file and why.Thanks again
any thoughts on these issues when dealing with loading csv files?
Thanks
July 4, 2018 at 2:14 pm
The PowerShell built-in cmdlet Import-Csv handles all the nonsense of commas within the quoted strings very nicely for us.
Try this for size.
It should also be noted about the script in its current form that when the columns are added to the datatable, the default data type of all columns is assumed to be .NET String (i.e. nvarchar), thus the table being inserted must have column types that will CAST implicitly from nvarchar or you'll get errors. Creating specific datatypes for the datatable's columns is another exercise altogether, though I would do it by querying the schema of the table.
Would be interested to see how this performs using the author's original input.
[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
$Date = $args[0]
$ConnectionString = $args[1]
$Directory = $args[2]
$Table = $args[3]
$file = "$Directory\$Table.csv" #This could be anything, the file just happens to be the same as the table name in this instance
$batchsize = 75000 #This could, theoretically, be any number - i found that this has the least drain on cpu & resources
$i = 0
try
{
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($Connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
$bulkcopy.DestinationTableName = $Table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize
$datatable = New-Object System.Data.DataTable
# Let Import-Csv deal with delimiter nonsense!
Import-Csv -Path $file |
ForEach-Object {
if ($i -eq 0)
{
# Add columns
$_.PSObject.Properties.Name |
ForEach-Object {
[void]$datatable.Columns.Add($_)
}
}
[void]$datatable.Rows.Add($_.PSObject.Properties.Value)
$i++
if (($i % $batchsize) -eq 0)
{
# Write batch
$bulkcopy.WriteToServer($datatable)
$datatable.Clear()
}
}
if ($datatable.Rows.Count -gt 0)
{
# Write remaining rows
$bulkcopy.WriteToServer($datatable)
$datatable.Clear()
}
}
catch
{
# A good script will handle exceptions here,
# e.g. failure to connect to SQL server, incorrect/un-castable datatype for column etc.
# for now, just re-throw
throw
}
finally
{
# Clean up in finally block - ensures resources are released even in event of errors.
($bulkcopy, $datatable) |
Where-Object { $_ -ne $null } |
ForEach-Object {
$_.Dispose()
}
}
Update - Only just noticed that this is a variation on an earlier post by layton 35542 however this version does not pre-load the entire CSV file which would be problematic for HUGE imports.
July 5, 2018 at 4:05 am
"Creating specific datatypes for the datatable's columns is another exercise altogether, though I would do it by querying the schema of the table."
can you drop a example of how this is done?
another issue I always run into is field lengths of csv versus sql def. How can you
1) detect mismatch and write to error log or
2) truncate on insert and show error log of data row.
I was testing using your example and ran into this condition
Exception calling "WriteToServer" with "1" argument(s): "Received an invalid column length from the bcp client
Thanks very interested on how people handle these conditions without affecting the performance of a large import.
July 5, 2018 at 4:26 am
Hi Bruin
If you and others are interested, I'll develop a full-fat version of this and post it later.
July 5, 2018 at 6:16 am
Sounds good !!!!
Thanks.
Viewing 15 posts - 16 through 30 (of 76 total)
You must be logged in to reply to this topic. Login to reply