Importing Tricky CSV Files with PowerShell

  • Ali M - Saturday, July 7, 2018 7:03 AM

    Bruin - Friday, July 6, 2018 12:04 PM

    Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.

    Many thanks again!!!

    • The order of the columns in the CSV must match that of the table schema
    • The number of columns in the CSV must also match the table schema. 
    • If, as previously discussed you use a staging table for the import, then the two constraints above are satisfied
    • If the CSV is without headers, and you modify my script as detailed then it should still work fine. The datatable columns are populated by the select top 1 * on the table you're importing to
    • Note also that if you want a delimiter other than comma, there is a -Delimiter argument to Import-Csv to achieve this

    Thanks!!!!

  • been away (out of mobile range) for a few days. This has been an extremely enlightening discussion.
    thanks everyone for contributing

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    β€œlibera tute vulgaris ex”

  • Stewart "Arturius" Campbell - Sunday, July 8, 2018 11:34 PM

    been away (out of mobile range) for a few days. This has been an extremely enlightening discussion.
    thanks everyone for contributing

    Good Sir, thank you for kicking it off with a great article!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If not to much trouble could you post the entire script adding in the piece for Headers I tried to add in the extra code, but maybe a copy paste problem as I have
    a syntax problem.

    Thanks again.

  • I was testing If I use the switch TruncateLongValues it gets this error.

    ForEach-Object : Cannot index into a null array.

  • Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

  • HighPlainsDBA - Monday, July 9, 2018 11:42 AM

    Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

    Nope... not the worst idea.  That would be XML. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • How would script handle:

    What about input files where the absence of a value is indicated by empty string? We get raw files that look like
    "Some String";value1;value2;;;;;value7;value8

    Thanks again.

  • Jeff Moden - Monday, July 9, 2018 11:56 AM

    HighPlainsDBA - Monday, July 9, 2018 11:42 AM

    Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

    Nope... not the worst idea.  That would be XML. πŸ˜‰

    That's a good one too!  I wonder how many countless hours of productivity have been lost to that one as well?  What's even worse about XML IMO is all the hype that came with it. I think the Segway scooter was only slightly more anticipated at the time but I don't know....  the years all start to blur together after awhile. πŸ™‚

  • HighPlainsDBA - Monday, July 9, 2018 2:10 PM

    Jeff Moden - Monday, July 9, 2018 11:56 AM

    HighPlainsDBA - Monday, July 9, 2018 11:42 AM

    Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

    Nope... not the worst idea.  That would be XML. πŸ˜‰

    That's a good one too!  I wonder how many countless hours of productivity have been lost to that one as well?  What's even worse about XML IMO is all the hype that came with it. I think the Segway scooter was only slightly more anticipated at the time but I don't know....  the years all start to blur together after awhile. πŸ™‚

    Heh... while we're on the subject... EDI has to be the 2nd worst and Adobe Post Script has to be tied with PDF for the worst.  And, yeah, I agree... have seen a whole lotta whiz-bang stuff come and go over the decades and wish some of the rest of it would go. πŸ˜€

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bruin - Monday, July 9, 2018 11:59 AM

    How would script handle:

    What about input files where the absence of a value is indicated by empty string? We get raw files that look like
    "Some String";value1;value2;;;;;value7;value8

    Thanks again.

    Bear in mind that the more logic that goes in the PowerShell will affect the performance!

    Should now handle the following

    • Empty columns in input are NULL. Staging table schema must have nullable columns: -EmptyValueIsNull
    • Alternative delimiters (default is comma): -Delimiter
    • Input file without column headers: -NoHeaders
    • Detect missing columns in input row (see comments in code)

    param
    (
      [string]$Path,
      [string]$ConnectionString,
      [string]$Table,
      [string]$Delimiter = ',',
      [int]$BatchSize = 75000,
      [switch]$StopOnError,
      [switch]$TruncateLongValues,
      [switch]$NoHeaders,
      [switch]$EmptyValueIsNull
    )

    $ErrorActionPreference = 'Stop'

    $columnLengthSql = @"
      select
      c.[max_length]
      from sys.columns c
      inner join sys.objects o
      on c.object_id = o.object_id and o.type = 'U'
      where o.Name = N'{0}'
      order by c.column_id

    "@
    try
    {
      # Connect to SQL Server
      $conn = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)
      $conn.Open()

      # Read the column widths, which we will use to check string (char type) columns
      $columnWidths = Invoke-Command -NoNewScope -ScriptBlock {

       try
       {
        $cmd = $conn.CreateCommand()
        $cmd.CommandText = [string]::Format($columnLengthSql, $Table)
        $cmd.CommandType = 'Text'
        $rdr = $cmd.ExecuteReader()

        while ($rdr.Read())
        {
          # Emit with into array
          [int]$rdr[0]
        }
       }
       finally
       {
        ($rdr, $cmd) |
          ForEach-Object {
          $_.Dispose()
        }
       }
      }

      # Number of columns in the target table
      $columnCount = $columnWidths.Count

      # Set up BCP stuff
      $bcpOption = ([System.Data.SqlClient.SqlBulkCopyOptions]::TableLock -bor [System.Data.SqlClient.SqlBulkCopyOptions]::UseInternalTransaction)
      $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($conn, $bcpOption, $null)
      $bulkcopy.DestinationTableName = $Table
      $bulkcopy.bulkcopyTimeout = 0
      $bulkcopy.batchsize = $BatchSize
      $datatable = New-Object System.Data.DataTable

      # Get the column data for the given table
      # Sneakily selecting 1 row from the table puts the schema into the datatable
      try
      {
       $sql = 'select top 1 * from [' + $Table + ']'
       $dad = New-Object System.Data.SqlClient.SqlDataAdapter($sql, $conn)
       [void]$dad.Fill($datatable)
      }
      finally
      {
       $dad.Dispose() 
      }

      # If we read a row, clear it.
      $datatable.Clear()

      # Init row counter
      $i = 0
     
      # Headers, or not?
      $headerArgument = @{}
     
      if ($NoHeaders)
      {
       # If user specifies -NoHeaders, generate dummy headers - as many as there are columns
       $headings = Invoke-Command -NoNewScope -ScriptBlock {
        
        for ($i = 0; $i -lt $columnWidths.Length; $i++)
        {
          "Dummy$($i)"
        }
       }

       $headerArgument = @{ Header = $headings }
      }
     
      # Let Import-Csv deal with delimiter nonsense!
      Import-Csv -Path $Path -Delimiter $Delimiter @headerArgument |
       ForEach-Object {

       try
       {
        # Validate imput column count
        # Import-Csv *ignores* extra columns in the input, so we will never know about them!
        # Empty columns, e.g. blah,,blah are rendered as empty strings.
        # If there are too few values, the remaining columns are rendered as $null (.NET null, not database null)

        $populatedColumnCount = ($_.PSObject.Properties.Value | Where-Object { $_ -ne $null } | Measure-Object).Count

        if ($populatedColumnCount -ne $columnCount)
        {
          throw "Incorrect number of columns in input. Got $($populatedColumnCount), expected $columnCount"
        }

        if ($TruncateLongValues -or $EmptyValueIsNull)
        {
          # Check columns - this will slow things down somewhat

          for ($col = 0; $col -lt $datatable.Columns.Count; ++$col)
          {
           $inputValue = $_.PSObject.Properties.Value[$col].ToString()

           if ($EmptyValueIsNull -and [string]::IsNullOrEmpty($inputValue))
           {
            $_."$($_.PSObject.Properties.Name[$col])" = [System.DBNull]::Value
           }
           elseif ($datatable.Columns[$col].DataType.FullName -eq 'System.String' -and $inputValue.Length -gt $columnWidths[$col])
           {
            Write-Warning "Row $($i + 1), Col $($col + 1): Value truncated"
            $_."$($_.PSObject.Properties.Name[$col])" = $inputValue.Substring(0, $columnWidths[$col])
           }
          }
        }

        [void]$datatable.Rows.Add($_.PSObject.Properties.Value)
       }
       catch
       {
        # Column datatype mismatch
        if ($StopOnError)
        {
          # Stop immediately
          throw
        }

        # Warn the user a row didn't go in and continue
        Write-Warning "Row $($i + 1): $($_.Exception.Message)"
       }

       if ((++$i % $BatchSize) -eq 0)
       {
        # Write batch
        $bulkcopy.WriteToServer($datatable)
        $datatable.Clear()
       }
      }

      if ($datatable.Rows.Count -gt 0)
      {
       # Write remaining rows
       $bulkcopy.WriteToServer($datatable)
       $datatable.Clear()
      }
    }
    catch
    {
      # A good script will handle exceptions here,
      # e.g. failure to connect to SQL server, incorrect/un-castable datatype for column etc.
      # for now, just re-throw
      throw
    }
    finally
    {
      # Clean up in finally block - ensures resources are released even in event of errors.
      ($bulkcopy, $datatable, $conn) |
       Where-Object { $_ -ne $null } |
       ForEach-Object {
     
       $_.Dispose()
      }
    }

  • Awesome!!!

    Will run some test cases.

    Thanks again.

  • I created a view so my csv matches my sql schema(staging table). I used these switches
    -TruncateLongValues -EmptyValueIsNull

    Message from script I got this for all rows

    WARNING: Row 1: Incorrect number of columns in input. Got 168, expected 0

     The actual staging table has 169 fields, but the view has 168 and matches the csv.

     Have I got something setup wrong?

     Thanks for your help

    Script is working great!!! the error above was when it went to find column lengths is was looking for a Table in the query
    I just changed it to look for the View 'V'
    select
    c.[max_length]
    from sys.columns c
    inner join sys.objects o
    on c.object_id = o.object_id and o.type = 'V'  -- View instead of Table 'U'
    where o.Name = N'{0}'
    order by c.column_id

    Many Thanks again!!!

  • Jeff Moden - Monday, July 9, 2018 11:56 AM

    HighPlainsDBA - Monday, July 9, 2018 11:42 AM

    Thanks for sharing this. Reminded me how much I hate comma delimited files, arguably the worst idea ever to take off in data processing. I'm mean seriously, of all the delimiters to choose from someone chose commas?? And now the years of inertia behind it continue to slow down many an IT worker.

    Nope... not the worst idea.  That would be XML. πŸ˜‰

    I must definitely agree with you here, Jeff

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    β€œlibera tute vulgaris ex”

  • When using $TruncateLongValues -or $EmptyValueIsNull the process slows way down. I have a 168 row csv and loads perfectly, but
    is there anything that could speed up the process?

     Thanks again.

Viewing 15 posts - 46 through 60 (of 76 total)

You must be logged in to reply to this topic. Login to reply