Parallel Load thousands csv Files

  • Hi,

    what is the fastest way to load thousands of csv files with the same structure into an OnPrem SQL DB (staging) - day by day? A parallelization would be ideal. These files have the customer number in the FileName (SSIS, Bulk Insert, Stored Proc., Data Factory, partitioning ...):

    Thanks

    Regards

    Nicole

    • This topic was modified 4 years, 9 months ago by  info 58414.
  • Are they all going into the same target table? If so, you'll need to be careful with parallelism, because of the likelihood of blocking occurring during INSERTs.

    I'd use SSIS for this. There are many others who prefer to stay totally in the realm of SQL Server and would use BCP. A lot depends on where your skills lie and what sort of environment you have.

    BCP has less overhead than SSIS and will likely run faster, though getting parallelism, logging and error handling well configured will (in my opinion) require more effort in BCP than SSIS.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I'm one of those folks that Phil is speaking of.  He's correct about BCP but BULK INSERT is the fastest and, done correctly, it's nasty fast.  A BCP format file will also work wonders for performance.  It isn't actually all that much more complicated than doing it in SSIS although both can be made to work well.  "It Depends" on your druthers.  I happen to prefer the non-SSIS methods for doing the actual imports and it's not horribly difficult to get parallelism especially if you use staging tables.

    You talk about have a file naming convention.  That's not a big issue and it's not difficult to extract customer and other embedded information from the file names.

    I'd need a bit more information on what you want done for file handling (example... what do you want done with files that have been imported?), what the file structure is, and what the final destination table looks like in order to be able to help more.  It would be handy (provided there's no sensitive or PII info in it) if you could provide the first 10 rows of a file or three along with what the expected record layout is.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi,

    thank you for this Informations!

    At night, the flat files from around 500 customers arrive on an FTP - Step by Step. The SSIS job currently starts at 08:00 AM and loads the files one after the other. This takes about 3 hours. The consideration is whether and how you can get these files into one or more staging target tables faster or without blocking or merge them. This target table is then the basis for the core DWH layers. It doesn't matter which tools are used for this.

    Event-based processing when files arrive is also an option...

    Regards

    Nicole

     

  • if they have all the same structure and depending on the size, and depending if quality is guaranteed (eg they are real CSV files, not just delimited files) then you could also consider using small powershell to "combine" all files onto one before inserting into the final table. (using import-csv and export csv combinations within a forearch loop - note that in this case it is possible to add a new column to the exported csv file containing the filename of the input file - this can be used further down the ETL to distinguish who the file is from)

    This normally results in a HUGE performance increase - depending on the size of the csv files.

    another possible option but from experience slower then the csv combine is to use the multiflatfile connection string - again this is a lot faster than a loop through each file

  • info 58414 wrote:

    Hi,

    thank you for this Informations!

    At night, the flat files from around 500 customers arrive on an FTP - Step by Step. The SSIS job currently starts at 08:00 AM and loads the files one after the other. This takes about 3 hours. The consideration is whether and how you can get these files into one or more staging target tables faster or without blocking or merge them. This target table is then the basis for the core DWH layers. It doesn't matter which tools are used for this.

    Event-based processing when files arrive is also an option...

    Regards

    Nicole

    You've left a whole bunch of information about the files, Nicole.  For starters...

    1. How many "records" does each file have.  An average will be fine.

    2. You said previously that all of the files are formatted the same.  How many "fields" do they have.

    3. Are the files "fixed field width" or delimited and, if delimited, what is the delimiter?

    4. Any quoted string identifiers in the file?

    5. If delimited, any delimiters embedded in the "fields"?

    6. What is the average width in bytes for the "records"?

    7. What is the format of the file names that you  previously mentioned?

    8. What does the ultimate destination table look like?  Or is there more than one destination table?

    9. What do you want done with the files after they've been loaded?

    10. Does the source directory of the files change every day?

    11. What kind of validation do you need done at the record and field level of the files?  For example, do you have to do a "merge/upsert" or is all of the data guaranteed to be "new"?

    I'm sure that I've missed a thing or two but that's a good start.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • %

    • This reply was modified 4 years, 9 months ago by  info 58414.
  • Hi Jeff,

    1. How many "records" does each file have. An average will be fine.

    => about 2000 per File

    2. You said previously that all of the files are formatted the same. How many "fields" do they have.

    => 15

    3. Are the files "fixed field width" or delimited and, if delimited, what is the delimiter?

    => Semikolon

    4. Any quoted string identifiers in the file?

    => No

    5. If delimited, any delimiters embedded in the "fields"?

    => No

    6. What is the average width in bytes for the "records"?

    => about 600 Bytes

    7. What is the format of the file names that you previously mentioned?

    => <CustID>_SalesReport.csv

    8. What does the ultimate destination table look like? Or is there more than one destination table?

    => 1 destination Table with same Structure + technical Cols

    9. What do you want done with the files after they've been loaded?

    => Load to Cleansing => load to Core DWH (3NF)

    10. Does the source directory of the files change every day?

    => no (allways empty after Processing)

    11. What kind of validation do you need done at the record and field level of the files? For example, do you have to do a "merge/upsert" or is all of the data guaranteed to be "new"?

    => no validation. Validation with T-SQL in Cleansing Area   🙂

    The goal must be, to bring the raw data into the SQL cleansing area as quickly / in parallel and as simply as possible. It's a DWH/OLAP -Project.

    Thanks and Regards

    Nicole

     

     

    • This reply was modified 4 years, 9 months ago by  info 58414.
  • adding to Jeff questions.

    when you load each file do you need to know, within the database, what was the file being loaded?

    you mention that the file has the custid on the name - is that custid also on the contents of the file?

    assuming it matters.

    And with regards to your current method of loading - are you using a loop and a flat file connection or are you using a multiflatfile connection without the loop? if the first have you tried the second and if so what was the performance improvement you got. and if you didn't try it could you do it now and see if performance is acceptable (I would expect 10% or less of the time vs loop option)

     

  • info 58414 wrote:

    Hi Jeff,

    1. How many "records" does each file have. An average will be fine.

    => about 2000 per File

    2. You said previously that all of the files are formatted the same. How many "fields" do they have.

    => 15

    3. Are the files "fixed field width" or delimited and, if delimited, what is the delimiter?

    => Semikolon

    4. Any quoted string identifiers in the file?

    => No

    5. If delimited, any delimiters embedded in the "fields"?

    => No

    6. What is the average width in bytes for the "records"?

    => about 600 Bytes

    7. What is the format of the file names that you previously mentioned?

    => <CustID>_SalesReport.csv

    8. What does the ultimate destination table look like? Or is there more than one destination table?

    => 1 destination Table with same Structure + technical Cols

    9. What do you want done with the files after they've been loaded?

    => Load to Cleansing => load to Core DWH (3NF)

    10. Does the source directory of the files change every day?

    => no (allways empty after Processing)

    11. What kind of validation do you need done at the record and field level of the files? For example, do you have to do a "merge/upsert" or is all of the data guaranteed to be "new"?

    => no validation. Validation with T-SQL in Cleansing Area   🙂

    The goal must be, to bring the raw data into the SQL cleansing area as quickly / in parallel and as simply as possible. It's a DWH/OLAP -Project.

    Thanks and Regards

    Nicole

    Excellent.  Almost there.  For Question #8, I have nothing to go on because I don't know what the record layout for the files is.  Can you attach, say, the first 10 lines of 1 file (including any header(s) but, please, make sure there's no PII) or at least what the column name and type list is and the CREATE TABLE statement for the first table to load into including any indexes, constraints (don't forget FKs if present), etc for that table please.

    It would also be helpful if we knew what the "cleansing" process is because we might be able to do a lot of that during the load.

    According to some of the specs you posted, this is only a million record load and only 600MB in total size.  We should be able to write something that will blast through the initial loading of all the files in just a minute or two even if we do it serially.  I just need the record layout information for the files and what the loaded CREATE TABLE statement is with the indexes, constraints, extra columns, etc.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    the Script for the Staging-Table below. All Cleansing-Activitys are done in the cleansing area. No Cleansing, Check Constraints, PK, FK, Index ...in Stage. Technical Columns startswith("tc").

    Regards Nicole

    CREATE TABLE [sap].[Orders](

    [id] [nvarchar](50) NULL,

    [sap_order_id] [nvarchar](50) NULL,

    [external_system_id] [nvarchar](50) NULL,

    [order_date] [nvarchar](50) NULL,

    [company_id] [nvarchar](50) NULL,

    [currency] [nvarchar](50) NULL,

    [accumulated_price] [nvarchar](50) NULL,

    [shipping_costs] [nvarchar](50) NULL,

    [order_status] [nvarchar](50) NULL,

    [purchase_order_indicator] [nvarchar](50) NULL,

    [sap_invoice_customer_id] [nvarchar](50) NULL,

    [sap_invoice_address_id] [nvarchar](50) NULL,

    [sap_delivery_customer_id] [nvarchar](50) NULL,

    [sap_delivery_address_id] [nvarchar](50) NULL,

    [discount_percentage] [nvarchar](50) NULL,

    [source_url] [nvarchar](50) NULL,

    [affiliate_id] [nvarchar](50) NULL,

    [voucher_code] [nvarchar](50) NULL,

    [payment_method] [nvarchar](50) NULL,

    [tcInsSourceFileName] [nvarchar](20) NULL,

    [tcInsTimestamp] [datetime] NULL,

    [tcInsAdfPipelineRunId] [nvarchar](20) NULL

    ) ON [PRIMARY]

    GO

     

    • This reply was modified 4 years, 9 months ago by  info 58414.
  • Thanks, Nicole.  I'm sorry I missed this yesterday and today.  I'm prepping for an SQL|Saturday that I'm driving to tomorrow and won't be back until Sunday.  I'll give it a shot for you then.  In the meantime, is there no chance of getting the first 10 rows of one of the files?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ah... finally found this post again.  Like I said on my last, is there any chance at of of you attaching the first 10 or so line of one of the files so I can do a quick import test?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply