How to speed up creating Clustered ColumnStore Index??

  • Hi

    we have a table in our Azure SQLDB with 35 billion rows, partitioned on a date. Indeed, a lot of rows.

    This table is already in production.

    We started creating a clustered column store index on it, but after 7 days (!) we got a time-out.

    Now we have created an empty copy of the original table, put the partition and clustered columnstore index on the copy and started copying day by day. But this will also take a lot of time. Average time for a day (in between 8 and 12 million record) is 1 hour.

    Does anybody know if there is a faster way of handling this??

    Regards

    Ron

  • what is the spec of that database instance? that will have an impact on it regardless of how you do it

    and were you creating the index online or offline? - if online its normal to take a lot, really a lot of time.

    with regards to the option to create a copy of the table and inserting into it - speed can vary depending on how you do it and again depending on the spec of the database instance.

    main thing that affects performance for a insert into select from on a columnstore index is the use of "tablock" so it can do it in parallel - but.... DO NOT do it within a transaction as there is a bug on multiple versions of SQL Server where this leads to a thread deadlock

  • Hi Frederico, exactly whisch specs do you want me to share?

  • for an azure sqldb its either how many vcores you allocated or how many dtu's depending on the model you chosen.

    and how much ram and what are the disk types chosen

    and if basic, standard or premium.

     

  • Hi Frederico

    we use Hyperscale with 32V Cores

    The index is created online. If we do it offline, how much time will it then take to put it online again?

  • How busy is that server?

    for that spec the inserts into new table should not take 1 hour for a pitiful 12 million rows unless server is nearly maxed.

    regarding table

    - are old records ever updated/deleted or static. and how many new records per day added/updated/deleted.

    - are old records required for day to day querying or they are randomly queried for example just a quarter/year end

    partitions - what is the row count of each partition on existing table

    indexes - do you have a single index on that table or multiple - if multiple are they partitioned aligned or are there some that are not.

    using the new table approach (which may not be the best option) would you be on a position to test the following - ensuring that the new table only contains the columnstore index, so we can take a baseline duration for the operation.

    insert into newtable with (tablock) select * from old table where $partition.<partitionfunctionname>(<partitioncolumn>) = <partitionnumber>

    where partitionnumber = one of the average count partitions on the main table

  • Hi see the answers below.

    Our main goal is try to copy huge volume (30 billion+) data from table A (row-store) to table B (column store), some way we are looking for a solution to make this process much faster.

    How busy is that server?

    Server is 70 to 80 % busy during midnight until 8 AM, during this period all the batches (Data Ingestion into Warehouse) executes.

    for that spec the inserts into new table should not take 1 hour for a pitiful 12 million rows unless server is nearly maxed.

    Per day volume is 10 million to 25 million average.

    regarding table

    - are old records ever updated/deleted or static. and how many new records per day added/updated/deleted.

    In this particular scenario only insert, there is no update or delete.

    - are old records required for day to day querying or they are randomly queried for example just a quarter/year end

    Old records getting queried on an aggregated basis only, no ad-hoc or random date select scenario.

    partitions - what is the row count of each partition on existing table

    partition Is based on day – as mentioned above Per day volume is 10 million to 25 million average.

    indexes - do you have a single index on that table or multiple - if multiple are they partitioned aligned or are there some that are not.

    Only single index – clustered partitioned aligned indx.

    using the new table approach (which may not be the best option) would you be on a position to test the following - ensuring that the new table only contains the columnstore index, so we can take a baseline duration for the operation.

    insert into newtable with (tablock) select * from old table where $partition.<partitionfunctionname>(<partitioncolumn>) = <partitionnumber>

    where partitionnumber = one of the average count partitions on the main table

    This we have already tested, hope you are trying to see whether the new table (with column store index) able to insert records into it, its already tested and working fine.

    Please revert if you meant this test for any other purpose in case.

  • thanks for replies - does clarify things a lot.

    regarding the particular test

    insert into newtable with (tablock) select * from old table where $partition.<partitionfunctionname>(<partitioncolumn>) = <partitionnumber>

    where partitionnumber = one of the average count partitions on the main table

    This we have already tested, hope you are trying to see whether the new table (with column store index) able to insert records into it, its already tested and working fine.

    my point was if you had tried exactly as I mentioned. e.g. using the tablock and the $partition where clause. and intention is to see how long that one alone takes - when server isn't busy!!

    I'll get back with more things later (after my work hours)

  • Hi Frederico, the tablock speeded things up. Thnx

    Another question.

    If a have a clustered columnstore index on a table, can it then still be useful to have a non-clustered index on an attribute that is often used in a lookup query, such as accountnumber? If so, why is that benificiary?

     

  • Increasing: The INSERT process will be improved by using an increasing column, such as the IDENTITY column, as a clustered index key, which will directly insert the new values at the logical end of the table.

    spider solitaire 2 suit

  • It sounds like there are multiple years (maybe 6 or so?) stored in the table.  Column Store or not , I'd split it up into multiple tables, at least by year but, maybe even by quarter or month.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply