Partioning Huge Table (650GB about)

  • Sorry for Butting In, this is not my expertise and I do not have a solutions. My apologies for that.

    What often happens is that huge amounts of data get stored, where only little information is stored. This seems to be occuring in the described situation. The information densitie is probably low in the 18 Gb table. Replicating that table 36 times where the number of differences is small (I assume) compounds considerable to the problem.

    So when I notice things like this I normally ask the designers what the function of data is. And can very often come up with suggestions which stores the same information in far less data. And although the data is 'less' accessable in the 'dense' format, the system often performs far better.

    If volume is no problem, you do not need to solve the 'problem'. But if somebody makes a bad design and tel you that volume is your problem, then I think you are allowed to ask questions.
    In practise I have seen to many times that volume was no problem when designing the system. And by the time it became a problem the designers shifted the problem to or a 'bigger' machine or to the DBA, where a design which had incorperated the volume question would have avoided the problem within the design.

    To me it seems like you have a similar 'problem'.

    One method I used to show that the design had a problem, was showing that I could for example store exactly the same data in for example only a half percent of the file size. (Exactly the same information). You could show this theoretically, or by using a compression algoritm. A compression algoritm can never produce a file size less than the information content, if this works this is a fair way to show that the information content can be at most the file produced.

    Again I am butting in. And do understand that the designers often come with a solution not considering actual production data and the datasize compared to the actual information stored. And that designers can be stubborn. And that it is to late to change. But at least I could show where to problem stems from. Not from the amount of information but from the way it was stored in far more data than is actually needed.

    Sorry for Ranting,
    Ben

  • Rosanero4Ever - Thursday, August 9, 2018 4:56 AM

    So, do you think Temporal Table can be useful for my requirements? In summary, I need to create a snapshot  fact table (see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/periodic-snapshot-fact-table/)
    Many thanks again

    It's the sort of problem for which they can be ideal, but whether or not they are a good fit depends a lot on your data. If there are high volumes of changes between the periods you would normally take snapshots, it may be less efficient. If you have really wide tables, where only very small portions change, you might be better avoiding taking  full snapshots entirely and instead handle the deltas in a more granular way.

    However, if your snapshot period is a reasonable approximation of how often the data changes (or is even more frequent) then temporal tables will likely be a more efficient approach than taking periodic snapshots - with the added bonus that you have a complete temporal breakdown, so if you need to look at hourly rather than daily changes you can do that too.

  • Rosanero4Ever - Thursday, August 9, 2018 4:56 AM

    andycadley - Tuesday, August 7, 2018 10:59 PM

    Yeah, I can see how in that situation there's a lot more incentive to work with something that's a lot more granular in terms of storage and undoing poor design decisions that have become ingrained in a system is often a difficult case to make when it's working as-is. I have one database where one-to-many relationships are implemented as varchar columns containing CSVs of foreign keys - an abomination that I'd dearly love to pick apart - but as terrible as it is to work with, it does what it needs to (mostly).

    So, do you think Temporal Table can be useful for my requirements? In summary, I need to create a snapshot  fact table (see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/periodic-snapshot-fact-table/)
    Many thanks again

    I'd have to say again, "It Depends".  Are all but the latest 1 or 2 periods totally static?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • How I classify the size of a DB changes with the size of a USB drive. Currently an almost 4TB database will fit on a SSD USB drive which fits in my pocket so I can take it with me to demo, so that is small. Greater than 4TB to 12 TB I consider medium. Larger than 12TB is I consider large. 15 years ago this would have been very different. The latest SSD max size from Intel is 32TB, although not pocket size (yet).

    Does the data have a clustered key? The benefit of a partitioned table is the date of the extract can be used as the partition column. To load data an un-partitioned table with the otherwise same definition as the partitioned table may be defined on the same file group of the destination partition. The data load will not impact the performance of the partitioned table and once loaded can be moved to the partition - the move is logical not physical - in a few seconds. The clustered key of the partitioned table should begin with the date of extract and a constraint must be defined to ensure the data date matches the partition date.
    Partitioned data indexes are by partition so queries that will pull data from all partitions will take longer. Defining an index on the partitioned table but placing it on a file group that is not partitioned may improve query performance by a factor of 2, depending on indexes available. The oldest data partition is removed (in seconds) and a new partition is added once the max number of desired periods is reached. Note that all of the partitions may be defined on one file group. This is useful for a data mart that will be reloaded monthly and the amount of data is very different in each partition each time.

    An alternative is to simply load the data (sorted first in original clustered index order) into one table using the extract date as the first column in the new clustered index. I can load 100 million rows into a clustered index table in under 15 minutes using the SSIS Balanced Data Distributor Transformation with 3 connections to the table with no table lock in a VM with 4 virtual processors and 32 GB memory. The I/O rate will be high and CPU will be 80% consumed by the process so if this will be run on a busy server this may not be the best choice. The following periods take slightly longer, about 18 minutes. The only issue is when deleting the oldest period the log must be large enough to contain all the rows for the period deleted and the process may take 1 hour or more. The delete can be divided into small chunks to minimize the log size.

    What is the purpose of this data? If the data contains a unique clustered index and the combined data will be used to identified changed data by period the better alternative is to save one full image followed only by changed data along with a status code A - add, C- changed, D - deleted for each period using the result of a full outer join. A second table is required to contain the period, index, and status from the first table constructed so that it contains the changed data index and all unmatched index data not marked as deleted from the previous period so each period contains a "virtual" table index of a period as well as a very fast temporal table type of query ability.

  • Jeff Moden - Friday, August 10, 2018 6:35 AM

    Rosanero4Ever - Thursday, August 9, 2018 4:56 AM

    andycadley - Tuesday, August 7, 2018 10:59 PM

    Yeah, I can see how in that situation there's a lot more incentive to work with something that's a lot more granular in terms of storage and undoing poor design decisions that have become ingrained in a system is often a difficult case to make when it's working as-is. I have one database where one-to-many relationships are implemented as varchar columns containing CSVs of foreign keys - an abomination that I'd dearly love to pick apart - but as terrible as it is to work with, it does what it needs to (mostly).

    So, do you think Temporal Table can be useful for my requirements? In summary, I need to create a snapshot  fact table (see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/periodic-snapshot-fact-table/)
    Many thanks again

    I'd have to say again, "It Depends".  Are all but the latest 1 or 2 periods totally static?

    Data in almost 30 periods can effectively change. For this reason i'm considering 36 snapshots with partitioning

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply