Better sampling setting

  • Hi,

    I have to import csv files with hundreds of columns and only few thousands of records.  The flat file import wizard does not do full sampling and may identify a column which has 60 character longest record as nvarchar(50) or the one with 1950 character longest record as nvarchar(1800).  When the import fails it does not tell which column failed to insert the data due to the truncation, so I have to run formulas in Excel to determine the maximum length of each column, which is quite annoying.  Is there a setting to tell the import flat file wizard to do full sampling of the file to make sure the data types and lengths are determined correctly?

    Thanks.

     

  • Are you processing the same files each day/week/month?  Or is every single file completely different from previous files processed?

    If this is a process that repeats with a new file submission, then you really should just use SSDT to build a package for each type of file.  Then - if a column length increases it is very easy to open that package and change the definition and the destination table definition and rerun the package.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Roust_m wrote:

    Hi,

    I have to import csv files with hundreds of columns and only few thousands of records.  The flat file import wizard does not do full sampling and may identify a column which has 60 character longest record as nvarchar(50) or the one with 1950 character longest record as nvarchar(1800).  When the import fails it does not tell which column failed to insert the data due to the truncation, so I have to run formulas in Excel to determine the maximum length of each column, which is quite annoying.  Is there a setting to tell the import flat file wizard to do full sampling of the file to make sure the data types and lengths are determined correctly?

    Thanks.

    Ok... so let's say that you had such a thing... What would you do when if grows bigger than the data in your permanent tables?

    The real key here is that you're getting data  from someone or something.  There should be some knowledge as to what the maximum size will ever be for any column just as there should be an assigned datatype.

    If no such knowledge or documentation is available, then set everything to as big as possible and go from there with the understanding that you will someday break the data in permanent tables.

    And this is just as fast as doing anything else because a scan of the data is a scan of the data. You might as well do it all in one scan and call it a day.

    As far as it not telling you which column(s) is/are being truncated, which version of SQL Server are you using?  A good ol' fashioned BULK INSERT in later versions will tell you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeffrey Williams wrote:

    Are you processing the same files each day/week/month?  Or is every single file completely different from previous files processed?

    If this is a process that repeats with a new file submission, then you really should just use SSDT to build a package for each type of file.  Then - if a column length increases it is very easy to open that package and change the definition and the destination table definition and rerun the package.

    It is mostly different files each time, most of them one off import.

  • Jeff Moden wrote:

    Roust_m wrote:

    Hi,

    I have to import csv files with hundreds of columns and only few thousands of records.  The flat file import wizard does not do full sampling and may identify a column which has 60 character longest record as nvarchar(50) or the one with 1950 character longest record as nvarchar(1800).  When the import fails it does not tell which column failed to insert the data due to the truncation, so I have to run formulas in Excel to determine the maximum length of each column, which is quite annoying.  Is there a setting to tell the import flat file wizard to do full sampling of the file to make sure the data types and lengths are determined correctly?

    Thanks.

    Ok... so let's say that you had such a thing... What would you do when if grows bigger than the data in your permanent tables?

    The real key here is that you're getting data  from someone or something.  There should be some knowledge as to what the maximum size will ever be for any column just as there should be an assigned datatype.

    If no such knowledge or documentation is available, then set everything to as big as possible and go from there with the understanding that you will someday break the data in permanent tables.

    And this is just as fast as doing anything else because a scan of the data is a scan of the data. You might as well do it all in one scan and call it a day.

    As far as it not telling you which column(s) is/are being truncated, which version of SQL Server are you using?  A good ol' fashioned BULK INSERT in later versions will tell you.

    No such knowledge available.  If a file has lets say 500 columns and the import wizard determines the right size 95% of the time.  Set 500 columns to a large column length is a tedious task.   This is why I was thinking of limiting the work.  If I knew which columns are too narrow, I would just change those columns instead of doing all.

    I am not worried about data going above the column lengths in the future, as most of the imports are one off.

    Also, bulk insert in my understanding would require and existing table to be created which is a bid task for a 500 column file.

    • This reply was modified 3 years, 6 months ago by  Roust_m.
  • Some background information from you would help illuminate the problem, I think.

    So far, it seems that you periodically receive one-off files which contains hundreds of columns, which you would like to import to SQL Server. However, you don't know the structure of these files (without opening them and looking for yourself). You don't know column names, data types or maximum column widths. You just want to import the files into a new table, allowing SQL Server to make a best guess on datatypes. Is that the requirement?

    I'd like to know what these new tables are subsequently being used for, given that no one in your organisation has any knowledge of their structure.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I've never tried this, but if you use the (oxymoronically named) ACE driver to read the text file, using a connection string like this:

    Provider=Microsoft.ACE.OLEDB.12.0;Data Source=<filepath here>;Extended Properties="TEXT;HDR=YES;IMEX=1";

    after setting TypeGuessRows in your Registry to 0, you may be able to increase the number of rows which get sampled (to a maximum of 16,384 allegedly).

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply