Every so often, one is required to process a comma delimited file (.csv) from an external source (e.g. suppliers price lists)
In general, this is relatively easy to do, using a variety of tools, e.g. BCP, BULK IMPORT, SSIS, to name but a few.
One way is to use PowerShell, making use of a SqlBulkCopy method, as shown here:
[void][Reflection.Assembly]::LoadWithPartialName("System.Data") [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") $Date = $args[0] $ConnectionString = $args[1] $Directory = $args[2] $Table = $args[3] $file="$Directory\$Table.csv" # $delim = "," $FirstRowColumnNames = $true $batchsize = 75000 #This could, theoretically, be any number - i found that this has the least drain on cpu & memory $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($Connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) $bulkcopy.DestinationTableName = $Table $bulkcopy.bulkcopyTimeout = 0 $bulkcopy.batchsize = $batchsize $datatable = New-Object System.Data.DataTable $reader = New-Object System.IO.StreamReader($file) $columns = (Get-Content $file -First 1).Split($csvdelimiter) if ($FirstRowColumnNames -eq $true) {$null = $reader.readLine() } foreach ($column in $columns) { $null = $datatable.Columns.Add() } $i = 0 while (($line = $reader.ReadLine()) -ne $null) { $null = $datatable.Rows.Add($line.Split($delim)) $i++ if (($i % $batchsize) -eq 0) { $bulkcopy.WriteToServer($datatable) $datatable.Clear() } } if($datatable.Rows.Count -gt 0) { $bulkcopy.WriteToServer($datatable) $datatable.Clear() } $reader.Close() $reader.Dispose() $bulkcopy.Close() $bulkcopy.Dispose() $datatable.Dispose()
This script should import several million rows in a matter of minutes, with minimal load on the server.
However, every so often, one encounters an issue with the import process, for example where the file contains a narrative or description field which, althought delimted by quotes, contains commas, for example:
ProductKey,Code,Description,Narrative,Comment 123,abc,product1,"Product reclassified, now freely available","reclassification by Bloggs, Bill" 345,fgh,product2,"Product withdrawn, issue with supplier","Product no longer available, issue logged re supplier by Soap, Joe"
These fields will cause the above to break, complaining about unexpected columns, or end-of-row not in expected location. We, therefore, need a means to isolate these commas within text between double-quotes and ignore them during the import process. The simplest way would be to replace the commas meant to be column delimiters with another delimiter (e.g. a pipe)
To begin, we need to replace the $delim variable with two new variables, e.g.
$olddelimiter = "," $newdelimiter = "|"
The first issue is to correctly identify those commas that are meant to be column delimiters and those that are part of a text field. This we do by converting each line into an array of characters
[char[]]$larr = $line
We then need to iterate through this list, locating each quote and incrementing a counter implemented for this purpose:
if($item -eq """"){$sp++}
We then run through every character, replacing the comma with a pipe wherever it is encountered and the modulus of the quote counter and two equals zero
if($item -eq $olddelimiter){ if ($sp%2 -eq 0) { $line = $line.Remove($point,1).Insert($point,$newdelimiter)}
We can then clean up the line by removing the unnecessary quotation marks
$Line = $line.Replace("""","")
The entire script herewith:
[void][Reflection.Assembly]::LoadWithPartialName("System.Data") [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") $Date = $args[0] $ConnectionString = $args[1] $Directory = $args[2] $Table = $args[3] $file="$Directory\$Table.csv" #This could be anything, the file just happens to be the same as the table name in this instance $oldDelimiter = "," $newDelimiter = "|" $FirstRowColumnNames = $true $batchsize = 75000 #This could, theoretically, be any number - i found that this has the least drain on cpu & resources $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($Connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) $bulkcopy.DestinationTableName = $Table $bulkcopy.bulkcopyTimeout = 0 $bulkcopy.batchsize = $batchsize $datatable = New-Object System.Data.DataTable $reader = New-Object System.IO.StreamReader($file) $columns = (Get-Content $file -First 1).Split($oldDelimiter) if ($FirstRowColumnNames -eq $true) { $null = $reader.readLine() } foreach ($column in $columns) { $null = $datatable.Columns.Add() } $i = 0 while (($line = $reader.ReadLine()) -ne $null) { $sp = 0 $point = 0 [char[]]$larr = $line #create a charater array of the specific line foreach ($item in $larr){ #itterate through the array to locate all quotes and commas if($item -eq """"){$sp++} #increment whenever a quotation mark found if($item -eq $oldDelimiter){ #is a comma encountered? if ($sp%2 -eq 0) { #if the character is a comma and the modulus of quotation counter and 2 is 0 $line = $line.Remove($point,1).Insert($point,$newDelimiter)} #replace the specific character with the new delimiter } $point++ } $Line = $line.Replace("""","") #cleanup unnecessary quotation marks in the line of data $null = $datatable.Rows.Add($line.Split($newDelimiter)) $i++ if (($i % $batchsize) -eq 0) { $bulkcopy.WriteToServer($datatable) $datatable.Clear() } } if($datatable.Rows.Count -gt 0) { $bulkcopy.WriteToServer($datatable) $datatable.Clear() } $reader.Close() $reader.Dispose() $bulkcopy.Close() $bulkcopy.Dispose() $datatable.Dispose()
This is maybe not the most elegant way of eliminating issues with comma separated fields, but it works (and is not excessively slow (it did increase the run time on 9 million rows to circa 40 minutes on the dev server)).
I am sure someone will be able to post a drastically improved method.
All comments and feedback welcome.