Best approach to import text file?

  • Jeff Moden (10/3/2011)

    The BCP format file would use similar delimiters and row terminators as you have in your BULK INSERT example (which is very nicely done, BTW). Perhaps if the OP attaches a data file, we can give it a whirl although I believe that the BULK INSERT you've created will be just as fast, if not faster. It's just the nature of BULK INSERT to be fast.

    Thank you for your opinion. I wasn't aware that BULK INSERT was fast but I try to use it whenever possible and it really shows great performance with large text files.

    Also would an SSIS package be executed as fast?

    The good thing about it is that it could be saved for later reuse and executed from DTEXEC.

    I'd love to see some benchmark from those different techniques but I don't have all those tools installed at home so I can't test all of them properly.

    Would be interesting to know though. 🙂

    Best regards,

    Best regards,

    Andre Guerreiro Neto

    Database Analyst
    http://www.softplan.com.br
    MCITPx1/MCTSx2/MCSE/MCSA

  • Here is the table:

    IF OBJECT_ID('tBufOMIM') IS NULL

    BEGIN

    CREATE TABLE tBufOMIM (

    NOINT NOT NULL PRIMARY KEY,

    TIVARCHAR(MAX)NULL,

    TXVARCHAR(MAX)NULL,

    SAVARCHAR(MAX)NULL,

    CSVARCHAR(MAX)NULL,

    CNVARCHAR(MAX)NULL,

    RFVARCHAR(MAX)NULL,

    CDVARCHAR(MAX)NULL,

    EDVARCHAR(MAX)NULL,

    MNVARCHAR(MAX)NULL,

    AVVARCHAR(MAX)NULL

    )

    PRINT 'Table tBufOMIM was created'

    END

    GO

    Format file:

    <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <RECORD>

    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="*RECORD**FIELD* NO"/>

    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="*FIELD* TI"/>

    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="*FIELD* TX"/>

    <FIELD ID="4" xsi:type="CharTerm" TERMINATOR="*FIELD* CN"/>

    <FIELD ID="5" xsi:type="CharTerm" TERMINATOR="*FIELD* SA"/>

    <FIELD ID="6" xsi:type="CharTerm" TERMINATOR="*FIELD* RF"/>

    <FIELD ID="7" xsi:type="CharTerm" TERMINATOR="*FIELD* CS"/>

    <FIELD ID="8" xsi:type="CharTerm" TERMINATOR="*FIELD* CD"/>

    <FIELD ID="9" xsi:type="CharTerm" TERMINATOR="*FIELD* ED"/>

    <FIELD ID="10" xsi:type="CharTerm" TERMINATOR="*FIELD* MN"/>

    <FIELD ID="11" xsi:type="CharTerm" TERMINATOR=""/>

    </RECORD>

    <ROW>

    <COLUMN SOURCE="1" NAME="NO" xsi:type="SQLINT"/>

    <COLUMN SOURCE="2" NAME="TI" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="3" NAME="TX" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="4" NAME="CN" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="5" NAME="SA" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="6" NAME="RF" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="7" NAME="CS" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="8" NAME="CD" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="9" NAME="ED" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="10" NAME="MN" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="11" NAME="AV" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    </ROW>

    </BCPFORMAT>

    And attached a part of my file. The whole file is 154 mb.

  • codebyo (10/3/2011)


    Jeff Moden (10/3/2011)

    The BCP format file would use similar delimiters and row terminators as you have in your BULK INSERT example (which is very nicely done, BTW). Perhaps if the OP attaches a data file, we can give it a whirl although I believe that the BULK INSERT you've created will be just as fast, if not faster. It's just the nature of BULK INSERT to be fast.

    Thank you for your opinion. I wasn't aware that BULK INSERT was fast but I try to use it whenever possible and it really shows great performance with large text files.

    Also would an SSIS package be executed as fast?

    The good thing about it is that it could be saved for later reuse and executed from DTEXEC.

    I'd love to see some benchmark from those different techniques but I don't have all those tools installed at home so I can't test all of them properly.

    Would be interesting to know though. 🙂

    Best regards,

    IIRC, the Bulk Insert "Task" of SSIS uses the same bulk insert "engine" behind the scenes and is reportedly just as fast as BULK INSERT in T-SQL. I haven't actually tested it because I don't use SSIS for such things... and that's a whole 'nuther story. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • WaitingWonder2 (10/3/2011)


    Here is the table:

    IF OBJECT_ID('tBufOMIM') IS NULL

    BEGIN

    CREATE TABLE tBufOMIM (

    NOINT NOT NULL PRIMARY KEY,

    TIVARCHAR(MAX)NULL,

    TXVARCHAR(MAX)NULL,

    SAVARCHAR(MAX)NULL,

    CSVARCHAR(MAX)NULL,

    CNVARCHAR(MAX)NULL,

    RFVARCHAR(MAX)NULL,

    CDVARCHAR(MAX)NULL,

    EDVARCHAR(MAX)NULL,

    MNVARCHAR(MAX)NULL,

    AVVARCHAR(MAX)NULL

    )

    PRINT 'Table tBufOMIM was created'

    END

    GO

    Format file:

    <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <RECORD>

    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="*RECORD**FIELD* NO"/>

    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="*FIELD* TI"/>

    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="*FIELD* TX"/>

    <FIELD ID="4" xsi:type="CharTerm" TERMINATOR="*FIELD* CN"/>

    <FIELD ID="5" xsi:type="CharTerm" TERMINATOR="*FIELD* SA"/>

    <FIELD ID="6" xsi:type="CharTerm" TERMINATOR="*FIELD* RF"/>

    <FIELD ID="7" xsi:type="CharTerm" TERMINATOR="*FIELD* CS"/>

    <FIELD ID="8" xsi:type="CharTerm" TERMINATOR="*FIELD* CD"/>

    <FIELD ID="9" xsi:type="CharTerm" TERMINATOR="*FIELD* ED"/>

    <FIELD ID="10" xsi:type="CharTerm" TERMINATOR="*FIELD* MN"/>

    <FIELD ID="11" xsi:type="CharTerm" TERMINATOR=""/>

    </RECORD>

    <ROW>

    <COLUMN SOURCE="1" NAME="NO" xsi:type="SQLINT"/>

    <COLUMN SOURCE="2" NAME="TI" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="3" NAME="TX" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="4" NAME="CN" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="5" NAME="SA" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="6" NAME="RF" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="7" NAME="CS" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="8" NAME="CD" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="9" NAME="ED" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="10" NAME="MN" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    <COLUMN SOURCE="11" NAME="AV" xsi:type="SQLVARYCHAR" NULLABLE="YES"/>

    </ROW>

    </BCPFORMAT>

    And attached a part of my file. The whole file is 154 mb.

    I've not used this particular form (xmlns) of a BCP format file before but I believe I can see one of the problems right off and it's one of the "problems" with BCP. Look at the first "Terminator" you've defined... now, look at it in the file. That's not actually a terminator, is it? In fact, the whole problem with this particular format file is that what you're calling "Terminators" are actually "field prefixes" and neither BCP or BULK INSERT can be made to work that way.

    Instead, you have to "teach" BCP though the format file to ignore the first field prefix using a fixed length format and then use the PREFIX of the NEXT field as the terminator of the current field. You also have to account for the CrLf characters on each field line in the table.

    So far as the file you've attached goes, I did ask that you post something without violating any private information. The file contains personal information for individuals... please delete it from your post immediately. You can repost it once you've "de-personalized" it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for the explanation, Jeff.

    I was getting errors when trying to make use of BCP or BULK INSERT with that format and I wondered what I was doing wrong.

    Best regards,

    Best regards,

    Andre Guerreiro Neto

    Database Analyst
    http://www.softplan.com.br
    MCITPx1/MCTSx2/MCSE/MCSA

  • Hello, I had to import from a lot of text files that varied in formats and had things like page numbers in that made pre-processing necessary

    I used SSIS but with a VB script first that processed the input file into one acceptable to SSIS flat file

    The code strips out headers/pages/blank lines and turns multi-line comments into a single string that goes into a text column, things of this nature

    It is very fast to run and I control it all from code, I use a scripted data transformation task as well so I have exact control over this part too.

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply