Loading more than 1 billion records from CSV File

  • xsevensinzx - Thursday, July 19, 2018 2:45 AM

    Yep, BULK INSERT is a great option. I used it as the sole way to slam 50 million+ records into a staging table and then use SP's to transform the data. The method in which you describe with row numbers is technically the same approach I take with Python. The only differences would be the added overhead in reading to that row number per process/thread/etc. In Python, the file is already open and read to the row number, so, as it gets to the row, it's partitioned out without actually closing the file and reopening it.

    I only recommend chunking the data on disk because that's the recommended approach from the Kimball methods. Do as much on disk as possible before bringing up to the service. But, there is no hard rule on that because this is not accounting for what tools you have available to do that.

    For example, I just decompressed 235 GB of files across 3,500 files to load 2.15 TB of data at 3.07 billion records in Azure. Files were already partitioned up and then loaded in parallel up to 15 threads. This was a slow trickle load as the data was being decompressed and re-encoded to UTF8. Took about 10 hours to load over night. So, even in that case, it took time for just 3 billion, but there is a lot of overhead here.

    Awesome...  thanks for the stats and the feedback.  (I love this community and always have.  You don't get this kind of information anywhere else!)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • xsevensinzx - Thursday, July 19, 2018 5:39 AM

    As mentioned, Python is one way, SSIS has options with scripted tasks or conditional splits, BULK INSERT has another option with starting at row number, and I'm sure Powershell may have an option too.

    Python is pretty easy .....

    Thank you for your anwser, in de past I used GSplit, but GSplit did not handle all 'CSV' files correctly and sometimes splitted in the middle of a field.

    And there are plenty of SDK's where a split can be build, but this is such a generic problem that I expect there are solutions on line. And although building solutions is fine, I think using an existing solution is more efficient en superiour. (And I am lacking the environment and knowledge to build with such a SDK :crazy:).

    Thanks,
    Ben

  • Jeff Moden - Wednesday, July 18, 2018 8:00 PM

    Haven't tried loading a billion rows, that's for sure.  I don't actually know anyone that has.  What's the largest number of rows and widest row length you've tried?  The most I've ever done is 10 million rows and a crazy 800 columns wide (not in the same load, thankfully).  The 10 million rows was about 50 columns wide (I don't remember the exact number) and the 800 column wide stuff was typically 30,000 rows (from Double-Click.net "spotlight" files).  Most of the stuff I work with now is 40-60 columns and between 100,000 and 500,000 rows.

    Guilty here - and not just once.
    Multiple migrations for multiple clients.
    In one case a single table, 1.2 billion rows, nearly 800 GB final size with compression on, over 50 columns
    In another case 1 TB oracle schema, 16 tables, 1.5 billion rows of which 600 Million on one single table. Tables varying between 20 and 50 columns
    The very last one I did (last year), 1 table, 700+million rows, 750GB size, 15 cols, of which 1 big LOB column - partitioned per quarter, data since 2012.

    In all cases files were split into acceptable chunks based on partition ranges. Each chunk loaded onto their individual tables and then partition switch onto the final master table.
    Parallel loads done of 10 threads at the time, with automatic management of threads so once 1 was finished next one would kick off until all files loaded.
    One of the loads had over 250 files.

    Due to volumn constraints none of the input files was ordered on the desired cluster key order - but all were, or could be, split on the desired partition range values.
    The last case above was SQL to SQL, the others were extract to flat files, then load to SQL Server
    All destination tables were heap at time of load, with indexes created post load of each table
    SSIS on first 2 cases, standard C# on the last one (a must as SSIS is quite bad with LOB data).

    In all cases buffer sizes and commit/batch sizes determined on a per table basis as optimal value is always dependent on row size and data types.

    Destination servers all had 64GB ram, and on last case it was a 6 core server, on the other cases a 12 Core server, SQL 2008 R2 Enterprise

  • frederico_fonseca - Thursday, July 19, 2018 7:32 AM

    Jeff Moden - Wednesday, July 18, 2018 8:00 PM

    Haven't tried loading a billion rows, that's for sure.  I don't actually know anyone that has.  What's the largest number of rows and widest row length you've tried?  The most I've ever done is 10 million rows and a crazy 800 columns wide (not in the same load, thankfully).  The 10 million rows was about 50 columns wide (I don't remember the exact number) and the 800 column wide stuff was typically 30,000 rows (from Double-Click.net "spotlight" files).  Most of the stuff I work with now is 40-60 columns and between 100,000 and 500,000 rows.

    Guilty here - and not just once.
    Multiple migrations for multiple clients.
    In one case a single table, 1.2 billion rows, nearly 800 GB final size with compression on, over 50 columns
    In another case 1 TB oracle schema, 16 tables, 1.5 billion rows of which 600 Million on one single table. Tables varying between 20 and 50 columns
    The very last one I did (last year), 1 table, 700+million rows, 750GB size, 15 cols, of which 1 big LOB column - partitioned per quarter, data since 2012.

    In all cases files were split into acceptable chunks based on partition ranges. Each chunk loaded onto their individual tables and then partition switch onto the final master table.
    Parallel loads done of 10 threads at the time, with automatic management of threads so once 1 was finished next one would kick off until all files loaded.
    One of the loads had over 250 files.

    Due to volumn constraints none of the input files was ordered on the desired cluster key order - but all were, or could be, split on the desired partition range values.
    The last case above was SQL to SQL, the others were extract to flat files, then load to SQL Server
    All destination tables were heap at time of load, with indexes created post load of each table
    SSIS on first 2 cases, standard C# on the last one (a must as SSIS is quite bad with LOB data).

    In all cases buffer sizes and commit/batch sizes determined on a per table basis as optimal value is always dependent on row size and data types.

    Destination servers all had 64GB ram, and on last case it was a 6 core server, on the other cases a 12 Core server, SQL 2008 R2 Enterprise

    You did well with that amount of memory available.  And, you've hit on a pretty good reason to chunk things up even if it were to make things slower (and I'm not saying it does in this case)... indexes and partitioned destination tables.

    Again... gotta love this community.  There's some pretty good stuff coming out of what started as a simple request..  Thank all of you!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, July 19, 2018 8:44 AM

    frederico_fonseca - Thursday, July 19, 2018 7:32 AM

    Jeff Moden - Wednesday, July 18, 2018 8:00 PM

    Haven't tried loading a billion rows, that's for sure.  I don't actually know anyone that has.  What's the largest number of rows and widest row length you've tried?  The most I've ever done is 10 million rows and a crazy 800 columns wide (not in the same load, thankfully).  The 10 million rows was about 50 columns wide (I don't remember the exact number) and the 800 column wide stuff was typically 30,000 rows (from Double-Click.net "spotlight" files).  Most of the stuff I work with now is 40-60 columns and between 100,000 and 500,000 rows.

    Guilty here - and not just once.
    Multiple migrations for multiple clients.
    In one case a single table, 1.2 billion rows, nearly 800 GB final size with compression on, over 50 columns
    In another case 1 TB oracle schema, 16 tables, 1.5 billion rows of which 600 Million on one single table. Tables varying between 20 and 50 columns
    The very last one I did (last year), 1 table, 700+million rows, 750GB size, 15 cols, of which 1 big LOB column - partitioned per quarter, data since 2012.

    In all cases files were split into acceptable chunks based on partition ranges. Each chunk loaded onto their individual tables and then partition switch onto the final master table.
    Parallel loads done of 10 threads at the time, with automatic management of threads so once 1 was finished next one would kick off until all files loaded.
    One of the loads had over 250 files.

    Due to volumn constraints none of the input files was ordered on the desired cluster key order - but all were, or could be, split on the desired partition range values.
    The last case above was SQL to SQL, the others were extract to flat files, then load to SQL Server
    All destination tables were heap at time of load, with indexes created post load of each table
    SSIS on first 2 cases, standard C# on the last one (a must as SSIS is quite bad with LOB data).

    In all cases buffer sizes and commit/batch sizes determined on a per table basis as optimal value is always dependent on row size and data types.

    Destination servers all had 64GB ram, and on last case it was a 6 core server, on the other cases a 12 Core server, SQL 2008 R2 Enterprise

    You did well with that amount of memory available.  And, you've hit on a pretty good reason to chunk things up even if it were to make things slower (and I'm not saying it does in this case)... indexes and partitioned destination tables.

    Again... gotta love this community.  There's some pretty good stuff coming out of what started as a simple request..  Thank all of you!

    Dividing a large file into smaller parts used to be the recommended method and I have done that. The ‘Balanced Data Distributor’ in SSIS divides the source into multiple buffers that are sent using multiple destinations (or connections) to the DB into one or many tables. This will work with tables with or without clustered keys, with partitions, or with clustered columnstore indexs. When defining the connections use the fast load option but do not select the table lock option. The constraints are the I/O system and CPU and both can be easily overloaded if too many connections are used.

    When presented with a billion row file on an unknown server the first task is to determine how fast the file can be read. That will be the minimum elapsed time to load the file. If the file is not on the server then the network speed will usually be the limiting factor.

    An example of the possible performance when loading 100 million rows into a table defined with a clustered index of two columns, FACT_ID and PRTN_ID, with the combination unique and an identity column defined as a bigint not indexed. There are 30 columns with an average row length of 202 bytes. With one connection the load completes in 25 minutes. With three connections the load completes in 8 minutes. With six connections the load completes in 7 minutes. The limitation is the rate that the source file can be read. Increasing the connections increases the I/O rate but also increases the response time. The identity column does add some locking overhead. The PRTN_ID is not used in this example but it allows the same data and table definition to be used to test a partitioned table load which shows similar times without moving tables.

    The VM was Server 2016 Standard with SQL Server 2016 Developer and Visual Studio 2017 defined with 8 processors, 32 GB memory, 3 virtual disks all defined on the same real 10TB raid 1 device. SQL Server max memory was set at 28GB. The destination table was defined on a filegroup with 3 files on one virtual disk, the log file was on a different virtual disk along with the tempdb files (unused in this example). At the start of each test run SQL Server is reset by clearing buffers and cache and resetting max memory to 4GB, waiting for 1 minute, then setting it back to 28GB. A similar VM defined on SSDs with SQL Server 2017 and 64GB of memory supports nine connections and loads the data slightly faster.

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply