is it possible to determine the max length of a field in a csv file?

  • This has been discussed before but I couldn't find a case/answer that might apply to my situation:

    From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.

    There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.

    I came across an old post on http://stackoverflow.com/questions/30696776/calculate-max-length-of-each-field-in-csv-file

    Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).

    I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.

    Thank you very much.

  • Well if you really don't know the lengths just import everything as varchar(max) then proceed to do some analysis from there.

  • this is the kind of annoying projects that make you reach for your nearest programming language;

    SSIS is handicapped when it comes to data lengths being dynamic.

    I've got an Generic CSV Importer SSIS package featuring a fancy script task that does this for well formed data;

    in my case, I don't even try to find the shape of the csv data.

    i have a script task potentially assumes column names are in the first row (or not).

    it reads each line in the document as a string array split on delimiters and quoted identifiers i select, and sticks the data into the row , RBAR style, into a DataTable.

    after the data table is filled, i can find the max(length) of each column.

    with that info, i build the destination table and bulk insert into it.

    since the DataTable has lengths available, I've used this to create a table, or even to issue ALTER TABLE ALTER COLUMN commands to make my varchars bigger prior to a bulk insert.

    I've used this on the NPPES data file, which is something like at least a gig or two in size(all physician/providers in the US, in a wide, ugly table with 100 columns or so)

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • A way to get more accurate results is to use the Advanced tab from the flat file connection.

    There's a button that says "Suggest Types...", there you can tell the number of rows to analyze to give an estimate on what your data might look like. It also includes an option to pad string columns so you'll prevent additional truncation for rows not analyzed.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks for all the inputs, can anyone recommend a good ETL tool to perform tasks like this?

  • You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.). I don't know how you are bringing in your CSV file data so I'm using a variable for the examples below. Here's how you'd do this using NGrams2B:

    DECLARE @sampleCSV varchar(max) = 'xx,yyyy,zzzzzzz,abc';

    WITH

    points AS

    (

    SELECT position = 0, token = ',' UNION ALL

    SELECT position, token FROM dbo.NGrams2B(@sampleCSV, 1) WHERE token = ','

    ),

    lengths(l) AS

    (

    SELECT (ISNULL(LEAD(position, 1) OVER (ORDER BY position), LEN(@sampleCSV)+1) - position) - 1

    FROM points

    )

    SELECT MAX(l)

    FROM lengths;

    I don't know if you have row and column delimiters but, let's pretend your column row delimiter was a char(13)+char(10), you would replace those with your column delimiter (a comma in this example).

    DECLARE @sampleCSV varchar(max) = 'xx,yyyy,zzzzzzz,abc'+char(13)+char(10)+'x,x,x,zzz';

    WITH

    points AS

    (

    SELECT position = 0, token = ',' UNION ALL

    SELECT position, token FROM dbo.NGrams2B(REPLACE(@sampleCSV,char(13)+char(10),','),1) WHERE token = ','

    ),

    lengths(l) AS

    (

    SELECT (ISNULL(LEAD(position, 1) OVER (ORDER BY position), LEN(REPLACE(@sampleCSV,char(13)+char(10),','))+1) - position) - 1

    FROM points

    )

    SELECT MAX(l)

    FROM lengths;

    It's worth noting that this is a purely set-based solution and I'll put it's performance up against a CLR or external script any day.:-P

    edit: got row and column mixed up. :hehe:

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (11/8/2016)


    You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.). I don't know how you are bringing in your CSV file data so I'm using a variable for the examples below. Here's how you'd do this using NGrams2B:

    edit: got row and column mixed up. :hehe:

    Thank you very much Alan, I will definitely look into it later.

  • Alan.B (11/8/2016)


    You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).

    I must be tired and I'm not seeing what you're talking about. Can you be more specific?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sometimes I use this ad hoc method which is very simple but works like a charm

    😎

    First a staging table

    IF OBJECT_ID(N'dbo.TBL_FLAT_FILE_IMPORT_STAGING') IS NULL

    BEGIN

    CREATE TABLE dbo.TBL_FLAT_FILE_IMPORT_STAGING

    (

    FFIS_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_FLAT_FILE_IMPORT_STAGING_FFIS_ID PRIMARY KEY CLUSTERED

    ,FFIS_DATA NVARCHAR(MAX) NULL

    )

    ;

    END

    Next an import of the file, lines not split

    TRUNCATE TABLE dbo.TBL_FLAT_FILE_IMPORT_STAGING;

    INSERT INTO dbo.TBL_FLAT_FILE_IMPORT_STAGING WITH (TABLOCK) (FFIS_DATA)

    EXEC XP_CMDSHELL 'TYPE C:\EE_DATA\Incoming\IMPORT_TEST_FILE.csv';

    Then a function to locate the delimiters

    CREATE FUNCTION [dbo].[ITVFN_DELIMITER_N_POS_TO_10_FULL_BIN_NULL]

    (

    @INPUT_STRING NVARCHAR(4000)

    ,@DELIM NCHAR(1)

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    WITH DELIMITER_COUNT(INP_LEN,DLMCNT,INP_STR) AS

    (

    SELECT

    (LEN(NCHAR(2) + @INPUT_STRING COLLATE LATIN1_GENERAL_100_BIN2 + NCHAR(3)) - 2) AS INP_LEN

    ,LEN(NCHAR(2) + @INPUT_STRING COLLATE LATIN1_GENERAL_100_BIN2 + NCHAR(3)) - LEN(REPLACE(NCHAR(2) + @INPUT_STRING COLLATE LATIN1_GENERAL_100_BIN2 + NCHAR(3),@DELIM,N'')) AS DLMCNT

    ,@INPUT_STRING AS INP_STR

    )

    SELECT

    DC.DLMCNT

    ,DC.INP_LEN

    ,DC.INP_STR

    ,CASE WHEN DC.DLMCNT > 0 THEN CI1.CI ELSE NULL END AS I1

    ,CASE WHEN DC.DLMCNT > 1 THEN CI2.CI ELSE NULL END AS I2

    ,CASE WHEN DC.DLMCNT > 2 THEN CI3.CI ELSE NULL END AS I3

    ,CASE WHEN DC.DLMCNT > 3 THEN CI4.CI ELSE NULL END AS I4

    ,CASE WHEN DC.DLMCNT > 4 THEN CI5.CI ELSE NULL END AS I5

    ,CASE WHEN DC.DLMCNT > 5 THEN CI6.CI ELSE NULL END AS I6

    ,CASE WHEN DC.DLMCNT > 6 THEN CI7.CI ELSE NULL END AS I7

    ,CASE WHEN DC.DLMCNT > 7 THEN CI8.CI ELSE NULL END AS I8

    ,CASE WHEN DC.DLMCNT > 8 THEN CI9.CI ELSE NULL END AS I9

    ,CASE WHEN DC.DLMCNT > 9 THEN CI10.CI ELSE NULL END AS I10

    FROM DELIMITER_COUNT DC

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, 1) ) CI1 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI1.CI + 1) ) CI2 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI2.CI + 1) ) CI3 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI3.CI + 1) ) CI4 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI4.CI + 1) ) CI5 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI5.CI + 1) ) CI6 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI6.CI + 1) ) CI7 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI7.CI + 1) ) CI8 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI8.CI + 1) ) CI9 (CI)

    CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI9.CI + 1) ) CI10 (CI)

    ;

    At last, find the widest entries in each column

    SELECT

    MAX(CP.I1 ) MXL_I1

    ,MAX(CP.I2 - CP.I1 ) MXL_I2

    ,MAX(CP.I3 - CP.I2 ) MXL_I3

    ,MAX(CP.I4 - CP.I3 ) MXL_I4

    ,MAX(CP.I5 - CP.I4 ) MXL_I5

    ,MAX(CP.I6 - CP.I5 ) MXL_I6

    ,MAX(CP.I7 - CP.I6 ) MXL_I7

    ,MAX(CP.I8 - CP.I7 ) MXL_I8

    ,MAX(CP.I9 - CP.I8 ) MXL_I9

    ,MAX(CP.I10 - CP.I9 ) MXL_I10

    FROM dbo.TBL_FLAT_FILE_IMPORT_STAGING FFIS

    CROSS APPLY dbo.ITVFN_DELIMITER_N_POS_TO_10_FULL_BIN_NULL(FFIS.FFIS_DATA,NCHAR(44)) CP;

  • halifaxdal (11/8/2016)


    This has been discussed before but I couldn't find a case/answer that might apply to my situation:

    From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.

    There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.

    I came across an old post on http://stackoverflow.com/questions/30696776/calculate-max-length-of-each-field-in-csv-file

    Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).

    I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.

    Thank you very much.

    Personally, I wouldn't do this with SSIS.

    I'd used a BULK INSERT to read the first line of the file to get a count of the delimiters (number of delimiters + 1 is the number of columns). If it's a header row, you can also derive the column name from that same row by using something like the DelimitedSplit8K function, which would also give you a column count.

    Use that information to dynamically create a table (with all VARCHAR(8000) or VARCHAR(MAX) {not my first choice, brw}) and import the data. Again, using the column names previously captured, write some dynamic SQL to calculate the MAX(LEN()) for each column. Then use THAT data along with the column names to create what you need for a final staging table for future use and tweak it manually to change datatypes as a first form of data validation for future imports.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/8/2016)


    Alan.B (11/8/2016)


    You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).

    I must be tired and I'm not seeing what you're talking about. Can you be more specific?

    I'm not ignoring your question - I'm sick and dealing with a sick kid :sick: :sick:. I'll have a more specific explanation of what I was talking about tomorrow.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (11/9/2016)


    Jeff Moden (11/8/2016)


    Alan.B (11/8/2016)


    You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).

    I must be tired and I'm not seeing what you're talking about. Can you be more specific?

    I'm not ignoring your question - I'm sick and dealing with a sick kid :sick: :sick:. I'll have a more specific explanation of what I was talking about tomorrow.

    Been there and done that. Glad it's not me. Hope you both feel better soon.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/9/2016)


    Alan.B (11/9/2016)


    Jeff Moden (11/8/2016)


    Alan.B (11/8/2016)


    You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).

    I must be tired and I'm not seeing what you're talking about. Can you be more specific?

    I'm not ignoring your question - I'm sick and dealing with a sick kid :sick: :sick:. I'll have a more specific explanation of what I was talking about tomorrow.

    Been there and done that. Glad it's not me. Hope you both feel better soon.

    Ok, tough week but I'm back.

    For context, what I posted was intended to be a SQL Server only alternative to the link that the OP posted: http://stackoverflow.com/questions/30696776/calculate-max-length-of-each-field-in-csv-file. Looking back, what I posted was not complete and I don't know if it's a great idea - it was just a thought...

    Taking a .csv file that looks like this:

    1,alan burstein,abc,123123123

    2,fred jones,xxx,123123123123

    3,sue smith,sssssss,51224123

    You could pull it into a variable like this:

    CREATE TABLE #test (string varchar(max));

    BULK INSERT #test

    FROM 'C:\Users\aburstein\Desktop\test.csv'

    WITH (FIELDTERMINATOR = '|||', ROWTERMINATOR = '|||'); -- bring in the text "as is"

    DECLARE @string varchar(max) = (SELECT string FROM #test);

    Then you could determine the max length for the values in each column like this:

    -- below is what your variable would look like

    DECLARE @string varchar(max);

    SET @string =

    '1,alan burstein,abc,123123123

    2,fred jones,xxx,123123123123

    3,sue smith,sssssss,51224123';

    SET @string = REPLACE(@string,char(13)+char(10),',');

    -- Ngrams

    SELECT colNbr, MaxLen = MAX(gap)

    FROM

    (

    SELECT LEAD(position,1) OVER (ORDER BY x.position)-x.position-1, x.position, (rn%4)+1

    FROM

    (

    SELECT 0, 0 UNION ALL

    SELECT ROW_NUMBER() OVER (ORDER BY position), position

    FROM dbo.NGrams2B(@string,1)

    WHERE token = ','

    ) x(rn, position)

    ) x (gap, position, colNbr)

    GROUP BY colNbr;

    alternatively you could do the same thing with delimitedsplit8k like this:

    -- below is what your variable would look like

    DECLARE @string varchar(max);

    SET @string =

    '1,alan burstein,abc,123123123

    2,fred jones,xxx,123123123123

    3,sue smith,sssssss,51224123';

    SET @string = REPLACE(@string,char(13)+char(10),',');

    -- delimitedsplit8K solution

    SELECT

    ColNbr = (ItemNumber-1)%4+1,

    MaxLen = MAX(LEN(item))

    FROM dbo.DelimitedSplit8K(@string,',')

    GROUP BY (ItemNumber-1)%4;

    Hopefully that makes more sense.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply