Query on a large database

  • I store my dates as integers (in YYYYMMDD format) and have an index on them. I have another DimDate table that has YYYYMMDD as clustered PK, and the remaining columns with various possible formats of the same date, and other ancillary information (like weekday for the date in different formats, weeknumber, Quarter, month in different formats, fiscal year, IsFirstDay of month, IsLastDay of the month etc.,).

    Advantage with this is that my retrieves are faster since my date is an integer, and I can join it with DimDate to get a lot of other information that is useful for reporting.

    SQLCurious

  • SQLCurious (10/21/2015)


    I store my dates as integers (in YYYYMMDD format) and have an index on them. I have another DimDate table that has YYYYMMDD as clustered PK, and the remaining columns with various possible formats of the same date, and other ancillary information (like weekday for the date in different formats, weeknumber, Quarter, month in different formats, fiscal year, IsFirstDay of month, IsLastDay of the month etc.,).

    Advantage with this is that my retrieves are faster since my date is an integer, and I can join it with DimDate to get a lot of other information that is useful for reporting.

    SQLCurious

    I'm curious. How or why are your retrieves faster? How do you measure that? Against what? I would expect it to be faster than a datetime or char(8), but slower than a date just by the number of bytes used. And if you really want to save bytes, you could go for a smallint which can have over 85 years. But that would still be a failure in the design, IMHO.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Partitioning can improve performance mainly in the following ways

    -For very large tables partition elimination does provide real benefits

    -Disk placement of files - striping the data where different months are places on different disks , or if the user is querying just the latest months data placing just that data in a fast disk improves performance - probably the most important way partitioning helps improve performance.

    Sliding windows is one of the reasons Partitioning improve ETL performance, Read performance is improved by proper disk placement.

    The OP mentioned fact tables which means they are already following some for Dimensional modeling the move to SSAS should bring significant improvement because it uses the same vertipaq compression and processing algorithms that were later ported to COLUMN Store indexes.

    Gonna post a video with a comparison soon.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/22/2015)


    Partitioning can improve performance mainly in the following ways

    -For very large tables partition elimination does provide real benefits

    With correctly chosen clustered index it does not matter how big is the table.

    Does not matter at all.

    Querying against clustered index gives you a partition matching you selection criteria.

    And that partition is dynamic - if you need a month it will be month, if you request a week it will be a week.

    If you need last week of the last month and 1st week of the current month - you'll get a partition of exactly those 2 weeks.

    Static partitioning you suggest only adds an overhead of chosing which table to query.

    -Disk placement of files - striping the data where different months are places on different disks , or if the user is querying just the latest months data placing just that data in a fast disk improves performance - probably the most important way partitioning helps improve performance.

    Users always work with the last month. At least 95% of queries always go against "the current month".

    Placing other month on other disks won't improve anything, as the data is just sitting on the disk, not taking any resources aoart from the disk space.

    If you want to really speed up queries you better distribute indexes on the same table over different drives. That's something which can really help with query performance, mainly due to parralell execution.

    Sliding windows is one of the reasons Partitioning improve ETL performance, Read performance is improved by proper disk placement.

    Only on badly designed databases.

    _____________
    Code for TallyGenerator

  • Hi Sergiy

    Clustered index only works when searching on the key column, it fails for all other DW use cases. A DW doesn't perform seeks it performs scans ( nobody fetches only one months data from a DW, usually they aggregate over months , years and even decades some times). n this case a clustered index looses any benefit it provides since most DW queries will result in an index Scan. I have already finished testing the cases and was able to query a table of 8M rows in 16 sec using just partitioning ( no indexes) vs a 15 seconds on a unparitioned table that has a clustered and non clustered index to answer just one business case.

    Will post the blog link shortly.

    here is the blog link with video

    http://enabledbusinesssolutions.com/blog/does-partitioning-improve-performance/

    Thanks

    Jay

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/22/2015)


    Hi Sergiy

    Clustered index only works when searching on the key column, it fails for all other DW use cases. A DW doesn't perform seeks it performs scans ( nobody fetches only one months data from a DW, usually they aggregate over months , years and even decades some times). n this case a clustered index looses any benefit it provides since most DW queries will result in an index Scan. I have already finished testing the cases and was able to query a table of 8M rows in 16 sec using just partitioning ( no indexes) vs a 15 seconds on a unparitioned table that has a clustered and non clustered index to answer just one business case.

    Will post the blog link shortly.

    Thanks

    Jay

    What kind of partitioning are you talking about?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQLCurious (10/21/2015)


    I store my dates as integers (in YYYYMMDD format) and have an index on them. I have another DimDate table that has YYYYMMDD as clustered PK, and the remaining columns with various possible formats of the same date, and other ancillary information (like weekday for the date in different formats, weeknumber, Quarter, month in different formats, fiscal year, IsFirstDay of month, IsLastDay of the month etc.,).

    Advantage with this is that my retrieves are faster since my date is an integer, and I can join it with DimDate to get a lot of other information that is useful for reporting.

    SQLCurious

    Do you have a test to quantify such a comparison in performance?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • J Livingston SQL (10/20/2015)


    sreeya (10/20/2015)


    The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

    thats a lot of sales a day...being curious what industry are you involved in?

    Likely retail or similar. Many companies have this depending on the size.

    I personally have 3 Fact tables. One with a billion records+ and the other two with over 100 million+. I'm in the digital marketing industry and the transactions are mostly digital marketing campaign logs coupled with retail transactions such as sales per day (that we are driving of course).

    On large fact tables like that, we use it mostly for reporting. Duplicates are allowed, so no primary keys are used. Time is the primary candidate for clustering across all facts. Other candidates include client identifiers and typical columns used for reporting based on usage only (due to the expense of course).

    How I manage to improve query performance is like what many other people have suggested. However, I also do reporting over many months (90 days to be exact) for consume path analysis across all facts and 10 dimensions. So, more than 100 million records are fetched, tossed into temporary tables to break them apart from the large tables, re-indexed, aggregated and returned.

    Not exactly 15 seconds fast based on my specs, but good enough due to the table and read sizes.

  • xsevensinzx (10/22/2015)


    J Livingston SQL (10/20/2015)


    sreeya (10/20/2015)


    The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

    thats a lot of sales a day...being curious what industry are you involved in?

    Likely retail or similar. Many companies have this depending on the size.

    I personally have 3 Fact tables. One with a billion records+ and the other two with over 100 million+. I'm in the digital marketing industry and the transactions are mostly digital marketing campaign logs coupled with retail transactions such as sales per day (that we are driving of course).

    On large fact tables like that, we use it mostly for reporting. Duplicates are allowed, so no primary keys are used. Time is the primary candidate for clustering across all facts. Other candidates include client identifiers and typical columns used for reporting based on usage only (due to the expense of course).

    How I manage to improve query performance is like what many other people have suggested. However, I also do reporting over many months (90 days to be exact) for consume path analysis across all facts and 10 dimensions. So, more than 100 million records are fetched, tossed into temporary tables to break them apart from the large tables, re-indexed, aggregated and returned.

    Not exactly 15 seconds fast based on my specs, but good enough due to the table and read sizes.

    Just a suggestion for the sake of rapid reporting...

    It seems to me that for such tables, only the data for today is being added to and, starting with yesterday's information and going back in time, nothing changes, correct? If so, it would seem that pre-aggregation and storage into mostly permanent tables (perhaps with a rolloff at 90 dayss) of each day (as it rolls off the calendar) an hour or two after "business midnight" would go a very long way to making the reports scream.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/22/2015)


    xsevensinzx (10/22/2015)


    J Livingston SQL (10/20/2015)


    sreeya (10/20/2015)


    The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

    thats a lot of sales a day...being curious what industry are you involved in?

    Likely retail or similar. Many companies have this depending on the size.

    I personally have 3 Fact tables. One with a billion records+ and the other two with over 100 million+. I'm in the digital marketing industry and the transactions are mostly digital marketing campaign logs coupled with retail transactions such as sales per day (that we are driving of course).

    On large fact tables like that, we use it mostly for reporting. Duplicates are allowed, so no primary keys are used. Time is the primary candidate for clustering across all facts. Other candidates include client identifiers and typical columns used for reporting based on usage only (due to the expense of course).

    How I manage to improve query performance is like what many other people have suggested. However, I also do reporting over many months (90 days to be exact) for consume path analysis across all facts and 10 dimensions. So, more than 100 million records are fetched, tossed into temporary tables to break them apart from the large tables, re-indexed, aggregated and returned.

    Not exactly 15 seconds fast based on my specs, but good enough due to the table and read sizes.

    Just a suggestion for the sake of rapid reporting...

    It seems to me that for such tables, only the data for today is being added to and, starting with yesterday's information and going back in time, nothing changes, correct? If so, it would seem that pre-aggregation and storage into mostly permanent tables (perhaps with a rolloff at 90 dayss) of each day (as it rolls off the calendar) an hour or two after "business midnight" would go a very long way to making the reports scream.

    I'm moving towards that model in the next 2 weeks, but in a data warehouse -> reporting data mart structure.

    Unfortunately, I cannot just always include the past N days. Everyone wants sliding windows on the data of course.

    However, preaggregation is still possible. Data is batch. Reporting data marts can be populated with raw data only for that client. Then it can be summarized from that separate database into a summary table once the mart is refreshed. So, it's technically not pulling from the centralized data warehouse with the large tables for the end user when it's summarizing, it's pulling from the data mart, which is only a subset of the larger Facts.

    Only issue is while anything can be preaggregated ahead of time for rapid reporting. The end users in my case have to re-aggregate the data often to try different computations on the data.

  • xsevensinzx (10/22/2015)


    Only issue is while anything can be preaggregated ahead of time for rapid reporting. The end users in my case have to re-aggregate the data often to try different computations on the data.

    Would the users have to re-aggregate at a finer resolution than per day?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Horizontal table partitioning , link with performance characteristics document below

    http://enabledbusinesssolutions.com/blog/does-partitioning-improve-performance/

    Jayanth Kurup[/url]

  • Jeff Moden (10/22/2015)


    xsevensinzx (10/22/2015)


    Only issue is while anything can be preaggregated ahead of time for rapid reporting. The end users in my case have to re-aggregate the data often to try different computations on the data.

    Would the users have to re-aggregate at a finer resolution than per day?

    That's not the problem though. To reclarify, this is historical data that's being pulled off a target set of records. The historical data will always be by day, but the target can change. If the target changes, then the historic tables have to change because they do not match the history of the target.

    The simplistic approach is to aggregate everything of every target, but that is at the cost of duplicating your historic data 100x (many-to-many relationships). The data requirements for that simplistic approach is pretty heavy.

  • As has been noted, the tables are definitely wrongly clustered, and that is causing the bad performance. Because, as has also been noted, of the "junior error" of assuming that the clustering key should be an identity column. Cluster instead by date and another column(s) if as indicated, and performance will improve dramatically.

    Partitioning seems to help performance in these cases because, as part of partitioning, you decide to partition by date and that causes you to cluster by date, so the partitioning key and the clustering key match. But really just the new clustering by itself would have caused a dramatic gain in performance.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Jayanth_Kurup (10/22/2015)A DW doesn't perform seeks it performs scans ( nobody fetches only one months data from a DW, usually they aggregate over months , years and even decades some times).

    How splitting a big table into several smaller ones could help here?

    _____________
    Code for TallyGenerator

Viewing 15 posts - 16 through 30 (of 104 total)

You must be logged in to reply to this topic. Login to reply