Importing Tricky CSV Files with PowerShell

Question

Importing Tricky CSV Files with PowerShell

Viewing 15 posts - 46 through 60 (of 76 total)

You must be logged in to reply to this topic. Login to reply

Bruin SSCrazy Points: 2091 More actions · Answer 1

Ali M - Saturday, July 7, 2018 7:03 AM
Bruin - Friday, July 6, 2018 12:04 PM
Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.
Many thanks again!!!
The order of the columns in the CSV must match that of the table schema
The number of columns in the CSV must also match the table schema.
If, as previously discussed you use a staging table for the import, then the two constraints above are satisfied
If the CSV is without headers, and you modify my script as detailed then it should still work fine. The datatable columns are populated by the select top 1 * on the table you're importing to
Note also that if you want a delimiter other than comma, there is a -Delimiter argument to Import-Csv to achieve this

Thanks!!!!

Stewart "Arturius" Campbell SSC Guru Points: 72634 More actions · Answer 2

been away (out of mobile range) for a few days. This has been an extremely enlightening discussion.
thanks everyone for contributing

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 3

Stewart "Arturius" Campbell - Sunday, July 8, 2018 11:34 PM
been away (out of mobile range) for a few days. This has been an extremely enlightening discussion.
thanks everyone for contributing

Good Sir, thank you for kicking it off with a great article!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Bruin SSCrazy Points: 2091 More actions · Answer 4

If not to much trouble could you post the entire script adding in the piece for Headers I tried to add in the extra code, but maybe a copy paste problem as I have
a syntax problem.

Thanks again.

Bruin SSCrazy Points: 2091 More actions · Answer 5

I was testing If I use the switch TruncateLongValues it gets this error.

ForEach-Object : Cannot index into a null array.

HighPlainsDBA SSCrazy Points: 2611 More actions · Answer 6

Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 7

HighPlainsDBA - Monday, July 9, 2018 11:42 AM
Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

Nope... not the worst idea. That would be XML. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Bruin SSCrazy Points: 2091 More actions · Answer 8

How would script handle:

What about input files where the absence of a value is indicated by empty string? We get raw files that look like
"Some String";value1;value2;;;;;value7;value8

Thanks again.

HighPlainsDBA SSCrazy Points: 2611 More actions · Answer 9

Jeff Moden - Monday, July 9, 2018 11:56 AM
HighPlainsDBA - Monday, July 9, 2018 11:42 AM
Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.
Nope... not the worst idea. That would be XML. 😉

That's a good one too! I wonder how many countless hours of productivity have been lost to that one as well? What's even worse about XML IMO is all the hype that came with it. I think the Segway scooter was only slightly more anticipated at the time but I don't know.... the years all start to blur together after awhile. 🙂

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 10

HighPlainsDBA - Monday, July 9, 2018 2:10 PM
Jeff Moden - Monday, July 9, 2018 11:56 AM
HighPlainsDBA - Monday, July 9, 2018 11:42 AM
Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.
Nope... not the worst idea. That would be XML. 😉
That's a good one too! I wonder how many countless hours of productivity have been lost to that one as well? What's even worse about XML IMO is all the hype that came with it. I think the Segway scooter was only slightly more anticipated at the time but I don't know.... the years all start to blur together after awhile. 🙂

Heh... while we're on the subject... EDI has to be the 2nd worst and Adobe Post Script has to be tied with PDF for the worst. And, yeah, I agree... have seen a whole lotta whiz-bang stuff come and go over the decades and wish some of the rest of it would go. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ali M SSC Veteran Points: 251 More actions · Answer 11

Bruin - Monday, July 9, 2018 11:59 AM
How would script handle:
What about input files where the absence of a value is indicated by empty string? We get raw files that look like
"Some String";value1;value2;;;;;value7;value8
Thanks again.

Bear in mind that the more logic that goes in the PowerShell will affect the performance!

Should now handle the following

Empty columns in input are NULL. Staging table schema must have nullable columns: -EmptyValueIsNull
Alternative delimiters (default is comma): -Delimiter
Input file without column headers: -NoHeaders
Detect missing columns in input row (see comments in code)


param 
(
  [string]$Path,
  [string]$ConnectionString,
  [string]$Table,
  [string]$Delimiter = ',',
  [int]$BatchSize = 75000,
  [switch]$StopOnError,
  [switch]$TruncateLongValues,
  [switch]$NoHeaders,
  [switch]$EmptyValueIsNull
)
$ErrorActionPreference = 'Stop'
$columnLengthSql = @"
  select 
  c.[max_length]
  from sys.columns c
  inner join sys.objects o
  on c.object_id = o.object_id and o.type = 'U'
  where o.Name = N'{0}'
  order by c.column_id
"@
try
{
  # Connect to SQL Server
  $conn = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)
  $conn.Open()
  # Read the column widths, which we will use to check string (char type) columns
  $columnWidths = Invoke-Command -NoNewScope -ScriptBlock {
   try
   {
    $cmd = $conn.CreateCommand()
    $cmd.CommandText = [string]::Format($columnLengthSql, $Table)
    $cmd.CommandType = 'Text'
    $rdr = $cmd.ExecuteReader()
    while ($rdr.Read())
    {
      # Emit with into array
      [int]$rdr[0]
    }
   }
   finally
   {
    ($rdr, $cmd) |
      ForEach-Object {
      $_.Dispose()
    }
   }
  }
  # Number of columns in the target table
  $columnCount = $columnWidths.Count
  # Set up BCP stuff
  $bcpOption = ([System.Data.SqlClient.SqlBulkCopyOptions]::TableLock -bor [System.Data.SqlClient.SqlBulkCopyOptions]::UseInternalTransaction)
  $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($conn, $bcpOption, $null) 
  $bulkcopy.DestinationTableName = $Table 
  $bulkcopy.bulkcopyTimeout = 0 
  $bulkcopy.batchsize = $BatchSize 
  $datatable = New-Object System.Data.DataTable 
  # Get the column data for the given table
  # Sneakily selecting 1 row from the table puts the schema into the datatable
  try 
  {
   $sql = 'select top 1 * from [' + $Table + ']'
   $dad = New-Object System.Data.SqlClient.SqlDataAdapter($sql, $conn)
   [void]$dad.Fill($datatable)
  }
  finally 
  {
   $dad.Dispose()  
  }
  # If we read a row, clear it.
  $datatable.Clear()
  # Init row counter
  $i = 0
  
  # Headers, or not?
  $headerArgument = @{}
  
  if ($NoHeaders)
  {
   # If user specifies -NoHeaders, generate dummy headers - as many as there are columns
   $headings = Invoke-Command -NoNewScope -ScriptBlock {
    
    for ($i = 0; $i -lt $columnWidths.Length; $i++)
    {
      "Dummy$($i)"
    }
   }
   $headerArgument = @{ Header = $headings }
  }
  
  # Let Import-Csv deal with delimiter nonsense!
  Import-Csv -Path $Path -Delimiter $Delimiter @headerArgument |
   ForEach-Object {
   try
   {
    # Validate imput column count
    # Import-Csv *ignores* extra columns in the input, so we will never know about them!
    # Empty columns, e.g. blah,,blah are rendered as empty strings.
    # If there are too few values, the remaining columns are rendered as $null (.NET null, not database null)
    $populatedColumnCount = ($_.PSObject.Properties.Value | Where-Object { $_ -ne $null } | Measure-Object).Count
    if ($populatedColumnCount -ne $columnCount)
    {
      throw "Incorrect number of columns in input. Got $($populatedColumnCount), expected $columnCount"
    }
    if ($TruncateLongValues -or $EmptyValueIsNull)
    {
      # Check columns - this will slow things down somewhat
      for ($col = 0; $col -lt $datatable.Columns.Count; ++$col)
      {
       $inputValue = $_.PSObject.Properties.Value[$col].ToString()
       if ($EmptyValueIsNull -and [string]::IsNullOrEmpty($inputValue))
       {
        $_."$($_.PSObject.Properties.Name[$col])" = [System.DBNull]::Value
       }
       elseif ($datatable.Columns[$col].DataType.FullName -eq 'System.String' -and $inputValue.Length -gt $columnWidths[$col])
       {
        Write-Warning "Row $($i + 1), Col $($col + 1): Value truncated"
        $_."$($_.PSObject.Properties.Name[$col])" = $inputValue.Substring(0, $columnWidths[$col])
       }
      }
    }
    [void]$datatable.Rows.Add($_.PSObject.Properties.Value)
   }
   catch
   {
    # Column datatype mismatch
    if ($StopOnError)
    {
      # Stop immediately
      throw
    }
    # Warn the user a row didn't go in and continue
    Write-Warning "Row $($i + 1): $($_.Exception.Message)"
   }
   if ((++$i % $BatchSize) -eq 0)
   { 
    # Write batch
    $bulkcopy.WriteToServer($datatable) 
    $datatable.Clear() 
   } 
  }

if ($datatable.Rows.Count -gt 0) { # Write remaining rows $bulkcopy.WriteToServer($datatable) $datatable.Clear() } } catch { # A good script will handle exceptions here, # e.g. failure to connect to SQL server, incorrect/un-castable datatype for column etc. # for now, just re-throw throw } finally { # Clean up in finally block - ensures resources are released even in event of errors. ($bulkcopy, $datatable, $conn) | Where-Object { $_ -ne $null } | ForEach-Object { $_.Dispose() } }

Bruin SSCrazy Points: 2091 More actions · Answer 12

Awesome!!!

Will run some test cases.

Thanks again.

Bruin SSCrazy Points: 2091 More actions · Answer 13

I created a view so my csv matches my sql schema(staging table). I used these switches
-TruncateLongValues -EmptyValueIsNull

Message from script I got this for all rows

WARNING: Row 1: Incorrect number of columns in input. Got 168, expected 0

The actual staging table has 169 fields, but the view has 168 and matches the csv.

Have I got something setup wrong?

Thanks for your help

Script is working great!!! the error above was when it went to find column lengths is was looking for a Table in the query
I just changed it to look for the View 'V'
select
c.[max_length]
from sys.columns c
inner join sys.objects o
on c.object_id = o.object_id and o.type = 'V' -- View instead of Table 'U'
where o.Name = N'{0}'
order by c.column_id

Many Thanks again!!!

Stewart "Arturius" Campbell SSC Guru Points: 72634 More actions · Answer 14

Jeff Moden - Monday, July 9, 2018 11:56 AM
HighPlainsDBA - Monday, July 9, 2018 11:42 AM
Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.
Nope... not the worst idea. That would be XML. 😉

I must definitely agree with you here, Jeff

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

Bruin SSCrazy Points: 2091 More actions · Answer 15

When using $TruncateLongValues -or $EmptyValueIsNull the process slows way down. I have a 168 row csv and loads perfectly, but
is there anything that could speed up the process?

Thanks again.