Total / average function advice

  • Yes, your queries will be vastly faster if you cluster on add datetime rather than identity. Hopefully your queries are then typically like this:

    SELECT

    SUM(...) AS Total_...,

    SUM(...) AS Total_...,

    SUM(...) AS Total_...

    FROM dbo.tablename

    -- right method of comparing date

    WHERE

    add_datetime >= '20120520' AND

    add_datetime < '20120527' --sum all values for the week of 05/20/2012 thru 05/26/2012

    Note that you do not want to use any functions on the add_datetime column, because that will lose you the benefits of the index. For example:

    -- WRONG method of comparing date

    WHERE

    CONVERT(varchar(30), add_datetime, 112) BETWEEN '20120520' AND '20120526'

    Doing that will force SQL to scan the whole table, even though a perfect index exists.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thanks Scott,

    Will be very useful considering the older hardware I'm using as the server

  • michaeljharper (5/31/2012)


    ScottPletcher (5/31/2012)


    *Cluster* the table on the add date/time (rather than on, say, an identity column, which may still be a good idea to have on this table, just don't cluster by it :-)).

    Thanks for the reply, I don't have in depth knowledge of indexes/clusters, I assume using one on the datetime column will increase query efficiency as like you say, many of the queries will be date ranges and things like warmest/coldest temps, highest wind speeds etc etc

    Many thanks,

    Mike

    Mike,

    As you can see, there are some rather heated opinions on what a clustered index should actually be on.

    Since most of your data will be looked up by date and time, putting the clustered index on the column that reflects the date and time of the reading is a really good idea but it's not enough. I say that because the other thing you may not know is that the clustered index is automatically appended to each and every non-clustered index. Uniqueness in indexes is very important to keep the index narrow and fast. If the clustered index isn't unique, it is possible all non-clustered indexes will also be none unique and a rather wide row ID and other junk will be added to the index to make it unique.

    The general rule of thumb that most of the industry experts I know go by is that Clustered indexes should be on a column or columns that are ever-increasing, UNIQUE, and as narrow as possible. Generally, an IDENTITY column, by itself, will fill that bill for many things (I disagree with people that say it's the worst thing you could do for way too many reasons to list on this post) but not for data like that which you're collecting.

    You also need to plan for the future and make your design bullet proof.

    With that thought in mind, let's consider that you could end up with multiple stations all collecting data and that you could end up with duplicate times in the table. So much for a UNIQUE Clustered Index and the performance that will offer.

    The way around this problem is to determine what makes each row UNIQUE in your table while considering future scalability. If you really want this stuff to be fast in the future, then take the time to also make it bullet proof and follow the actual best practices. Just adding a clustered index to the datetime column isn't necessarily the best practice because it may not be unique.

    Following the best practice of making a clustered index ever-increasing, UNIQUE, and narrow as well as making this highly scalable for the future, I'd like to make the following recommendations.

    1. I didn't see it in the rows you posted but it might already be there. If not, add a column for a station ID EVEN IF YOUR PRESENT INTENTION IS TO COLLECT FROM JUST ONE STATION! This should probably be an INT column. Contrary to what has been said on this thread, this number would ideally be FKd to (comes from) a table that uses an IDENTITY column as both the PK and the Clustered index.

    2. When you make your clustered index on your "readings table" (and to repeat yet again), it really should be ever-increasing, Unique, and narrow. So make it on the DateTime column of the reading AND the station ID that you added in (1) above.

    3. IF AND ONLY IF the datetime and station ID columns aren't guaranteed to make a unique combination, then add an IDENTITY column and include it as the 3rd column of the clustered index. If the clustered index isn't unique, then you stand the great chance of losing some performance as the project scales up.

    As a side bar, when folks stand up and nearly rant about not using an IDENTITY column in a table as the clustered index, consider that you'll also find some absolute Ninja's for SQL Server who are much quieter and have used it with great success. Also consider how many people don't actually know about the clustered index being appended to non-clustered indexes and what happens to an index if they're not actually unique. What that means is that people who say that you absolutely must or must not use an IDENTITY column as the clustered index are absolutely wrong either way. As Lynn Pettis has said throughout this thread and is absolutely correct in saying so, "IT DEPENDS" and always will. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I wanted to post this tidbit of information separately so that it wouldn't get lost in the long discussion of my previous post.

    Please see the following video. It was created by one of the foremost industry leaders of knowledge of SQL Server. It's about how and why to choose certain columns as the clustered index and what happens if you chose the wrong columns. My recommendation is to study it deeply. Play it dozen's of times until it's absolutely clear and that you understand it 100% because, so far as I'm concerned, it's the definitive and best presentation there is on the subject.

    http://technet.microsoft.com/en-us/sqlserver/Video/gg508879

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Many thanks for the detailed response, really appreciate it. My University database tutor mentioned the same thing about the possibility of multiple stations logging data(even though I'm only using one).

    So if I'm understanding you correctly, create another table with a station ID and location field, then have the station ID as a foreign key in the main WeatherData table. Finally create a clustered index on the datetime + station ID fields so it will always be unique?

    I have a primary key (weather_id) identity column that increases by 1 for every record, will I still need this?

    I'll look at the video you posted - clustering is still a bit vague!

    Many thanks

    Mike

    Jeff Moden (6/2/2012)


    michaeljharper (5/31/2012)


    ScottPletcher (5/31/2012)


    *Cluster* the table on the add date/time (rather than on, say, an identity column, which may still be a good idea to have on this table, just don't cluster by it :-)).

    Thanks for the reply, I don't have in depth knowledge of indexes/clusters, I assume using one on the datetime column will increase query efficiency as like you say, many of the queries will be date ranges and things like warmest/coldest temps, highest wind speeds etc etc

    Many thanks,

    Mike

    Mike,

    As you can see, there are some rather heated opinions on what a clustered index should actually be on.

    Since most of your data will be looked up by date and time, putting the clustered index on the column that reflects the date and time of the reading is a really good idea but it's not enough. I say that because the other thing you may not know is that the clustered index is automatically appended to each and every non-clustered index. Uniqueness in indexes is very important to keep the index narrow and fast. If the clustered index isn't unique, it is possible all non-clustered indexes will also be none unique and a rather wide row ID and other junk will be added to the index to make it unique.

    The general rule of thumb that most of the industry experts I know go by is that Clustered indexes should be on a column or columns that are ever-increasing, UNIQUE, and as narrow as possible. Generally, an IDENTITY column, by itself, will fill that bill for many things (I disagree with people that say it's the worst thing you could do for way too many reasons to list on this post) but not for data like that which you're collecting.

    You also need to plan for the future and make your design bullet proof.

    With that thought in mind, let's consider that you could end up with multiple stations all collecting data and that you could end up with duplicate times in the table. So much for a UNIQUE Clustered Index and the performance that will offer.

    The way around this problem is to determine what makes each row UNIQUE in your table while considering future scalability. If you really want this stuff to be fast in the future, then take the time to also make it bullet proof and follow the actual best practices. Just adding a clustered index to the datetime column isn't necessarily the best practice because it may not be unique.

    Following the best practice of making a clustered index ever-increasing, UNIQUE, and narrow as well as making this highly scalable for the future, I'd like to make the following recommendations.

    1. I didn't see it in the rows you posted but it might already be there. If not, add a column for a station ID EVEN IF YOUR PRESENT INTENTION IS TO COLLECT FROM JUST ONE STATION! This should probably be an INT column. Contrary to what has been said on this thread, this number would ideally be FKd to (comes from) a table that uses an IDENTITY column as both the PK and the Clustered index.

    2. When you make your clustered index on your "readings table" (and to repeat yet again), it really should be ever-increasing, Unique, and narrow. So make it on the DateTime column of the reading AND the station ID that you added in (1) above.

    3. IF AND ONLY IF the datetime and station ID columns aren't guaranteed to make a unique combination, then add an IDENTITY column and include it as the 3rd column of the clustered index. If the clustered index isn't unique, then you stand the great chance of losing some performance as the project scales up.

    As a side bar, when folks stand up and nearly rant about not using an IDENTITY column in a table as the clustered index, consider that you'll also find some absolute Ninja's for SQL Server who are much quieter and have used it with great success. Also consider how many people don't actually know about the clustered index being appended to non-clustered indexes and what happens to an index if they're not actually unique. What that means is that people who say that you absolutely must or must not use an IDENTITY column as the clustered index are absolutely wrong either way. As Lynn Pettis has said throughout this thread and is absolutely correct in saying so, "IT DEPENDS" and always will. 😉

  • Jeff Moden (6/2/2012)


    ... As Lynn Pettis has said throughout this thread and is absolutely correct in saying so, "IT DEPENDS" and always will. 😉

    Jeff, Thanks for the support. I was getting really frustrated and reading your entire post was refreashing and welcome. It seems this isn't the only thing people get heated up about.

  • michaeljharper (6/2/2012)


    Hi Jeff,

    Many thanks for the detailed response, really appreciate it. My University database tutor mentioned the same thing about the possibility of multiple stations logging data(even though I'm only using one).

    Then I like your tutor. (S)he is thinking ahead.

    So if I'm understanding you correctly, create another table with a station ID and location field, then have the station ID as a foreign key in the main WeatherData table. Finally create a clustered index on the datetime + station ID fields so it will always be unique?

    Yes... except the main WeatherData table would have the foreign key from the new StationID column back to the "Station" table.

    I have a primary key (weather_id) identity column that increases by 1 for every record, will I still need this?

    Heh... "It Depends". In this case, I think the answer would be "no" but, except for a little extra space used, it's not actually going to hurt anything. I've found that it's sometimes adventagous to have such a column to do "other" things that I can't even think of right now insofar as this table goes but, no... I don't believe you'll need it for this table.

    As a side bar, I'll frequently use a "dual keyed" table where the IDENTITY column is the Clustered PK and another set of column(s) makes an "alternate key" but I don't see it being needed here.

    I'll look at the video you posted - clustering is still a bit vague!

    Many thanks

    Mike

    Like I said... watch it more than once. There's a whole lot of information in that video and it seems like everytime I watch it, it provides me with yet another revelation that I'd not thought of.

    Thanks for the feedback on all of this, Mike. I wish you luck in your studies. Sounds like you don't need it, though. You have the right attitude and it sounds like you have an excellent tutor on the subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lynn Pettis (6/2/2012)


    Jeff Moden (6/2/2012)


    ... As Lynn Pettis has said throughout this thread and is absolutely correct in saying so, "IT DEPENDS" and always will. 😉

    Jeff, Thanks for the support. I was getting really frustrated and reading your entire post was refreashing and welcome. It seems this isn't the only thing people get heated up about.

    You were doing the right thing. I just lended a helping hand.

    As yet another sidebar, I just watched the movie again and, much to my dismay, found that I left out one of the "rules" for clustered index selection. It should have been ever-increasing, Unique, narrow, AND NOT NULL. I just flat forgot about the NOT NULL thing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi again!

    Just to clarify the difference between a clustered and non-clustered index; a clustered index 'physically' changes the order of the table data to match the index (i.e order by date and time) and thus will improve query speeds (especially in my case of date ranges etc) and a non clustered index is more like a pointer to rows which helps speed up queries that use a certain condition often (i.e if a last name column was going to be used in lots of queries it might be an idea to stick an index on it)

    Thanks,

    Mike

  • That's mostly true but the are some caveats.

    The clustered index has a physical order but it's not necessarily contiguous. Things like page splits can change the physical order.

    Not all non-clustered indexes are simple pointers. If the index has all of the leaf level data in it that's needed by the query, the data will be returned from the index instead of making the extra trip back to the table. Such indexes are called "covering indexes" because they "cover" everything needed by the query. As you might guess, they can be incredibly fast because they're usually a fair bit more narrow (meaning, more rows per page) than the actual data table itself.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff 😀

  • So much for a UNIQUE Clustered Index and the performance that will offer.

    The way around this problem is to determine what makes each row UNIQUE in your table while considering future scalability. If you really want this stuff to be fast in the future, then take the time to also make it bullet proof and follow the actual best practices. Just adding a clustered index to the datetime column isn't necessarily the best practice because it may not be unique.

    When you create a clustered index, SQL itself will make it UNIQUE if necessary. A clustering key in SQL Server is always UNIQUE -- it must be for SQL to function.

    SQL adds another 4 bytes, and this will almost certainly be used on the table in q, but for this table your clustered index should still be datetime. Let SQL worry about making it UNIQUE in this case.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (6/4/2012)


    When you create a clustered index, SQL itself will make it UNIQUE if necessary. A clustering key in SQL Server is always UNIQUE -- it must be for SQL to function.

    Oh..... be careful now, Scott. While that may be true of a PK, it is absolutely NOT true of a Clustered Index. Don't get the two confused. PK's will automatically become the Clustered index if it hasn't been used and PKs are inherently unique. Clustered Indexes are NOT inherently unique and, unless formed by the PK, you must specifically state that the clustered index be unique if you want it to be unique.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/4/2012)


    ScottPletcher (6/4/2012)


    When you create a clustered index, SQL itself will make it UNIQUE if necessary. A clustering key in SQL Server is always UNIQUE -- it must be for SQL to function.

    Oh..... be careful now, Scott. While that may be true of a PK, it is absolutely NOT true of a Clustered Index. Don't get the two confused. PK's will automatically become the Clustered index if it hasn't been used and PKs are inherently unique. Clustered Indexes are NOT inherently unique and, unless formed by the PK, you must specifically state that the clustered index be unique if you want it to be unique.

    Jeff, I believe the difference you're discussing with Scott is the difference between the internal Clustering Key and the external result of the clustering index. Internally, he's correct, so Non-Clustered index key-lookups always return to the correct row.

    From a design standpoint, however, it's not as important as the difference between a Primary Key as a table identifier a user can use and a non-unique clustered index.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Ah... understood and agreed. I just don't want folks to think that it'll be unique on the values in the defined columns of the clustered index. Thanks, Craig.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply