Sins of SQL: The Time Table

  • RonKyle (6/2/2010)


    Not everyone is an expert at every part of SQL Server and while this persons article does not show the best approach to this, it is sufficient enough to point this out and provide alternate suggestions and or links as other have. It’s not necessary to comment that you'd like to give 0 stars if you could; that not being constructively critical of an article but just plain mean and rude.

    You're right, that not everyone can be an expert at every part of SQL. But it's probably not a good idea to write about something for publication that you don't know well. Now this article is out there and an inexperienced user may try the technique if they aren't aware of or able to check the discussion. As the date dimension is critical to a successful data warehouse deployment, the chances have just been increased that the deployment will fail. There probably should be some basic vetting system for published articles on the site.

    Maybe be comment is also unnecessary, but the strange title did open the door a bit, too. What exactly is the shortcoming of SQL that constitutes the "sin?" The problem is an exploding dimension was caused not by some flaw in SQL but by the author combining the date and time dimensions, which is not a recommended practice for precisely this reason. So while the comments perhaps unnecessary, calling the comment plain mean and rude isn't helpful either, as the responder did not likely mean it that way.

    "But it's probably not a good idea to write about something for publication that you don't know well. "

    Not everyone realizes what they think they know is not iin fact something they know well. In our line of work we've all seen bad code done by people with the best of intentions who thought they were doing it right and believed they did know the material.

    "..calling the comment plain mean and rude isn't helpful either, as the responder did not likely mean it that way."

    I don’t know about you but if I come across as being mean or rude to someone, I hope someone will let me know because written text is not good at convey emotions and so it is easy to misinterpret what someone has meant. This is why I included the "constructive criticism" part instead of just saying your being mean and rude.

    Kindest Regards,

    Just say No to Facebook!
  • Whilst there may be more efficent mechanisms to generate the data itself (you may get better with a loop in a stored procedure or other evils), it's a lot easier to integrate with business intelligence studio, we are somewhat blessed in that time is linear and predictable. There will only be 24 hours tomorrow, the day after etc so it's possible to load these values in batches ahead of time (or incrementally a small step for every cube data load operation). You'd not re-generate the whole date range of many years every time, no more than you re-process every row in a warehouse every time you incrementally populate a cube.

    Any dimension re-processing that involves this CTE can incrementally load a days worth of values in at a time, or even more or less. Generating a weeks worth of data at 1 row/second resolution takes < 10 seconds on my VM test rig (2ghz CPU alloance, 1gb RAM running SQL 2008 and not much else) when I write it directly to a table. You'd never call the CTE once per fact-row or anything crazy like that.

    You can use this to store the data at the native SQL datetime resolution in your datamart/ODS, but process the cube at any arbitrary resolution at or below the native resolution of SQL Server without having to modify your fact data or re-generate lookup tables and re-key your data.

    Regarding some of the other date concerns:

    1) Using non-gregorian calendars: There are free and widely available mechanisms to map UTC time-spans to/from these calendars. These can be incorporated into the functionality where needed without too much ado.

    2) Designating holidays/special dates - easily achievable via using lookup tables that can be joined into the rowset the CTE is generated.

    3) Auditing of the dimension (when incorporating any of the above, or other related change scenarios) can be done readily by calling the function and dumping the data out somewhere for analysis/archiving.

  • Whilst there may be more efficent mechanisms to generate the data itself (you may get better with a loop in a stored procedure or other evils), it's a lot easier to integrate with business intelligence studio, we are somewhat blessed in that time is linear and predictable. There will only be 24 hours tomorrow, the day after etc so it's possible to load these values in batches ahead of time (or incrementally a small step for every cube data load operation). You'd not re-generate the whole date range of many years every time, no more than you re-process every row in a warehouse every time you incrementally populate a cube

    Steven,

    Not sure you're addressing the central issue. It is almost always better to separate the date and time dimension. I add almost not because of can think of an exception, only that there probably is one. If the time is a seperate dimension, then you don't have to add anything at all. My time table is 3602 rows and my date dimension is about 2000 rows. The latter of course grows from time to time. If I were to combine them, the number of rows becomes horribly unwieldy and would continue to grow at a fast rate. What issue to you have that doesn't allow you to separate them?

    1) Using non-gregorian calendars: There are free and widely available mechanisms to map UTC time-spans to/from these calendars. These can be incorporated into the functionality where needed without too much ado.

    2) Designating holidays/special dates - easily achievable via using lookup tables that can be joined into the rowset the CTE is generated.

    The mapping for these and any other item should be in the dimension table. For those who can't shake the OLTP approach, this can be hard to see, as the normal normalization rules don't apply. Doing them this way lets you set up these items as properties, or in the case of the Gregorian calendar another calendar. In my dw, Type of Day is a property consisting of Workday/Weekend/Holiday. This breakdown has yielded very interesting patterns. In an OLTP database this would be a lookup table, but in the OLAP model it is two columns in the dimension table, one for the key, one for the name.

    I had a chance to review a dw that had a combined date time dimension that had time intervals of 15 minutes. A single year thus required 35K+ rows. The response time was acceptable, but noticeably slow. When I seperated them, I was not only able to provide the time granularity down to the minute, which was what was desired, the query response times were much faster. It also made it easy to set the date on one axis and the time on the other axis to see the ebb and flow of things through the days.

  • Not sure you're addressing the central issue. It is almost always better to separate the date and time dimension. I add almost not because of can think of an exception, only that there probably is one.

    Doing any kind of query on your kind of dual dimension approach becomes pretty evil. You effectively are unable to do efficient date-range queries (i.e. 01 March 2009 10:00 through to 05 February 2010 15:00) without breaking it down into several different lookups and unifying the result sets I've seen all kinds of evil used to do this, from dynamic SQL in sprocs that do openquery across to SAAS, through to .NET code the generates the SSAS query manually - it's difficult to implement, and rarely mastered.

    If the time is a seperate dimension, then you don't have to add anything at all. My time table is 3602 rows and my date dimension is about 2000 rows.

    Notwithstanding the difficulty in querying, your figures reveal that you are only storing at most 6 years worth of data (assuming day resolution), and little more than minute-level resolution on your time data. If you only go down to that level (in the code I asupplied set the skip-seconds flag, and use 6 year date interval), then the approach I've supplied in the article will only occupy a total of ~40mb for the date records themselves and less than 4m records, which SSAS won't even break a sweat over - especially since not all levels are specified in all hierarchies. In terms of number of aggregations, it's near as damn the same because SASS will need to compute them for each permutation of date/time anyway.

    Additionally, SSAS will not actually consume 4 million times as much space, because it only computes the aggregations where facts are present on the hierarchies. Not seeing the problem here? The space consumed by even the most spartan of fact tables and attributes on your other dimension will make this look inconsequential for all but the smallest of data warehouse scenarios. I've used this approach to have very efficient reporting of year-weekday-hour-minute trends versus year-quarter-weekday-hour-minute etc by adding a new hierarchy.

    The latter of course grows from time to time. If I were to combine them, the number of rows becomes horribly unwieldy and would continue to grow at a fast rate. What issue to you have that doesn't allow you to separate them?

    With the approach I've demonstrated, there's an even easier way to group data by time and date seperately - define seperate hierarchies for time and date within the dimension - and only expand them to the level of detail you actually use. These will then be space efficient These will then be traversible just as efficiently as your existing approach, and you don't have to change your date storage on your facts at all - ever. The number of rows grows for the dates grows at a perfectly constant rate, so space usage is easily estimable, and it is possible to re-define both the resolution of the dimension and the hierarchies without ever having to modify/reduce the fidelity of the source data. If you have *huge* fact tables, not having to undertake this chore is something of a win.

    The mapping for these and any other item should be in the dimension table. For those who can't shake the OLTP approach, this can be hard to see, as the normal normalization rules don't apply. Doing them this way lets you set up these items as properties, or in the case of the Gregorian calendar another calendar. In my dw, Type of Day is a property consisting of Workday/Weekend/Holiday. This breakdown has yielded very interesting patterns. In an OLTP database this would be a lookup table, but in the OLAP model it is two columns in the dimension table, one for the key, one for the name.

    Easily addressed by joining other elements on to the date table. The advantage of this technique (using date start/end spans on your tagging/attribute tables and left joining them onto the CTE rowsets) is that you can specify a two week holiday with a single record that covers *any* number of subdivisions in that period (so whether or not you change the resolution up or down, there's no additional work that needs doing). You can then reference them easily in your time dimensions.

    I had a chance to review a dw that had a combined date time dimension that had time intervals of 15 minutes. A single year thus required 35K+ rows. The response time was acceptable, but noticeably slow. When I seperated them, I was not only able to provide the time granularity down to the minute, which was what was desired, the query response times were much faster. It also made it easy to set the date on one axis and the time on the other axis to see the ebb and flow of things through the days.

    I'm not saying it's not possible to commit great evil with this (it's pretty easy to do it with just about any technique, including this one) - but I suspect a lot of the comments here immediately go "Bah, CTE = Fail" without having put the code anywhere near a test project. If the one you've seen is slow, I'd suspect there's something else at play there - and would be hesistant to dismiss the technique off-hand. (if they'd put all dimension values in a single level of a hierarchy, such as the one generated by default by the designer, that'd kill it, for example).

  • Doing any kind of query on that kind of thing becomes pretty evil. You effectively are unable to do efficient date-range queries (i.e. 01 March 2009 10:00 through to 05 February 2010 15:00) without breaking it down into three or four different queries and unifying the result sets. I've seen all kinds of evil used to do this, from dynamic SQL in sprocs that do openquery across to SAAS, through to .NET code the generates the SSAS query manually.

    I have no requirement for that kind of query, so I would not want to pay that cost. If that is a normal query for you, that might explain your modeling choice. If I were told to ensure the users could do that, however, I would first very carefully explain to management what the trade off would be.

    I'm not saying it's not possible to commit great evil with this (it's pretty easy to do it with just about any technique, including this one) - but I suspect a lot of the comments here immediately go "Bah, CTE = Fail" without having put the code anywhere near a test project.

    That's not a fair comment. I've seen first hand the results of a combined date time dimension and what happens when they are separated. I'd bet others have, too. There's a reason it's the method Kimball would advocate in most circumstances, if I'm reading his works correctly.

    OK, that's my last comment on this topic subject. After this, we will just agree to disagree. I am working on my own dimension date and time series articles. If it comes out, you may critique me.

  • What most people are saying is that this approach isn't necessarily the correct approach in a DW environment.

    My complaint with the code is the use of a recursive CTE. This is simply RBAR and will not scale well. Please take the time to read the article I referenced in one of my earlier posts for more information.

  • Lynn Pettis (6/2/2010)


    What most people are saying is that this approach isn't necessarily the correct approach in a DW environment.

    My complaint with the code is the use of a recursive CTE. This is simply RBAR and will not scale well. Please take the time to read the article I referenced in one of my earlier posts for more information.

    I'm familiar with the usual issues regarding CTE performance. However for linear up-front population of a defined time dimension (or occasionally re-computing if you decide to change the granularity of the dimension in a way you can't mask with hierarchies), then those concerns, whilst valid, are not going to have an impact that's noteworthy relative the rest of the work going on.

    Even for long ranges of time, it'll be over in seconds/minutes - the CTEs in the example supplied are substantially tweaked (notice there are two, one for numbers 1-60, the others for the individual days in the range) by repeatedly re-using the same resultset several times, avoiding the majority of the computation/large result sets in favor of smaller individual ranges that are effectively cross-joined together.

  • Steven James Gray (6/2/2010)


    Lynn Pettis (6/2/2010)


    What most people are saying is that this approach isn't necessarily the correct approach in a DW environment.

    My complaint with the code is the use of a recursive CTE. This is simply RBAR and will not scale well. Please take the time to read the article I referenced in one of my earlier posts for more information.

    I'm familiar with the usual issues regarding CTE performance. However for linear up-front population of a defined time dimension (or occasionally re-computing if you decide to change the granularity of the dimension in a way you can't mask with hierarchies), then those concerns, whilst valid, are not going to have an impact that's noteworthy relative the rest of the work going on.

    Even for long ranges of time, it'll be over in seconds/minutes - the CTEs in the example supplied are substantially tweaked (notice there are two, one for numbers 1-60, the others for the individual days in the range) by repeatedly re-using the same resultset several times, avoiding the majority of the computation/large result sets in favor of smaller individual ranges that are effectively cross-joined together.

    Simple question, have you read the article I suggested you read? If not, please do.

  • Simple question, have you read the article I suggested you read? If not, please do.

    I'd already read it when it was first posted to the thread - but when it gets to the point where it actually shows some benefit (the 60 million row example, particularly), that's about 2 years worth of data at the second resolution, or more than a century worth at the minute resolution. When you process a centuries worth of data like that, a 30 minute wait doesn't seem that bad. With a little tweaking to the code I supplied in the article by moving out the 'numbers CTE' to a table variable, and then only using the DTE for the day-to-day stepping with joins to the time-values that'd speed it up orders of mangitude - all whilst using the overal technique described. I'm not married to the CTE itself, just the principle of not creating surrogate identities for time values - and the function is an example of how this can readily be achieved.

    The advantage of the existing implementation I supplied though is that it is incrementally rendered, and does not wait for the entire CTE to complete before giving you rows. They come as fast as SQL generates them, which is faster than every server I've ever thrown this at is capable of of actually processing the dimension.

  • Doing any kind of query on your kind of dual dimension approach becomes pretty evil. You effectively are unable to do efficient date-range queries (i.e. 01 March 2009 10:00 through to 05 February 2010 15:00) without breaking it down into several different lookups and unifying the result sets I've seen all kinds of evil used to do this, from dynamic SQL in sprocs that do openquery across to SAAS, through to .NET code the generates the SSAS query manually - it's difficult to implement, and rarely mastered.

    With separate date and time dimensions, the query is not bad at all.

    SELECTCOUNT(*)

    FROMdbo.Fact fact

    INNERJOIN dbo.DimDate dt_start

    ONfact.DateKey = dt_start.DateKey

    INNERJOIN dbo.DimTime tm_start

    ONfact.TimeKey = tm_start.TimeKey

    INNERJOIN dbo.DimDate dt_end

    ONfact.DateKey = dt_end.DateKey

    INNERJOIN dbo.DimTime tm_end

    ONfact.TimeKey = tm_end.TimeKey

    WHERE(

    dt_start.DayDate> '3/1/2009'

    OR(

    dt_start.DayDate= '3/1/2009'

    ANDtm_start.Hour>= 10

    )

    )

    AND(

    dt_end.DayDate< '3/5/2010'

    OR(

    dt_end.DayDate= '3/5/2010'

    ANDtm_end.Hour<= 15

    )

    )

    Runs very fast.

    Additionally, SSAS will not actually consume 4 million times as much space, because it only computes the aggregations where facts are present on the hierarchies. Not seeing the problem here? The space consumed by even the most spartan of fact tables and attributes on your other dimension will make this look inconsequential for all but the smallest of data warehouse scenarios. I've used this approach to have very efficient reporting of year-weekday-hour-minute trends versus year-quarter-weekday-hour-minute etc by adding a new hierarchy.

    Whoa. Timeout. If you're talking about SSAS, your argument only considers partition processing, NOT dimension or aggregation processing SPACE, TIME, or MEMORY CONSUMPTION. With separate dimensions, the time dimension at second level granularity is going to have process 3600 rows, 3600 members at the key level. Your date dimension for 10 years of data at the day grain would be about the same. On the other hand, your composite design with 10 years at second level granularity would be 13,140,000 records at the key level. Which is the more performant solution, one that requires processing of 3600 * 2 records or the one that requires processing of 3600 squared?

  • I'd like to see an full benchmarked example, with one source file, two ETLs into two competing models and sample queries against them. SSAS cubes, too if you like.

  • Steven,

    The message that I'm (strongly) getting from you is that "My approach is best." As proof of "My approach is best", you keep bringing up the date range query. Quite frankly, this is starting to look like the issue I've seen over the years where someone gets attached to a theory or approach, and will do anything to defend their position. I.e., they like using a hammer and everything looks like a nail.

    After 29 years of creating analytical applications of myriad types, I've never been called on to do date range queries EXCEPT for operational, forensic reporting. Analytical business questions, especially those where SSAS is the tool of choice, are almost always focused on some measure over a standardized period of time (day, week, etc). When sub-day analysis is necessary, the questions are typically slanted towards determining business performance at times of day. E.g. average store sales per hour of day for the last three years. In this context, a simple and separate "Time" dim is the best approach.

    Your article would be reasonable if it had been titled something like, "Non Time Dimension Methodologies for Handling Range Queries". If you had positioned your approach as an alternative method for handling certain types of problem, there would be mild controversy at worst. Instead, you chose to bash traditional methodologies (Date and Time Dims) and use a single problem as proof of your approach's superiority.

    If your goal is to garner attention, you have succeeded nicely. Unfortunately, it's probably not the kind of attention that you want. Going forward, I will be on the lookout for your articles. And be sure not to read them.

    Have a nice day.

    Dan.

  • If your goal is to garner attention, you have succeeded nicely. Unfortunately, it's probably not the kind of attention that you want. Going forward, I will be on the lookout for your articles. And be sure not to read them.

    And I just wanted to give zero stars! But also agree with you. When you title your article "The Sins of SQL", I at a minimum expect that you have a very basic understanding of dimensional modeling and the difference between OLAP and OLTP.

  • dan_public (6/2/2010)


    Steven,

    .....[G]oing forward, I will be on the lookout for your articles. And be sure not to read them.

    Have a nice day.

    Dan.

    Not having "29 years of experience" but being keenly interested in learning from those with good ideas &/or years of experience/expertise, I am disappointed when I see comments such as Dan's closing sentences as such mean-spirited words make it harder for me to consider the rest of his post (which seems to raise valid points, otherwise). In this field, as in most that live in shades of gray, a public square full of differing opinions is a sign of a healthy environment....one better maintained by avoiding mean-spirited, rude and ad hominem statements.

    Thank you to the author and those (respectfully) offering differing opinions/reasoning. The back and forth is MUCH more valuable than any article by itself (controversial OR sacrosanct).

  • Steven James Gray (6/2/2010)


    Regarding some of the other date concerns:

    1) Using non-gregorian calendars: There are free and widely available mechanisms to map UTC time-spans to/from these calendars. These can be incorporated into the functionality where needed without too much ado.

    2) Designating holidays/special dates - easily achievable via using lookup tables that can be joined into the rowset the CTE is generated.

    3) Auditing of the dimension (when incorporating any of the above, or other related change scenarios) can be done readily by calling the function and dumping the data out somewhere for analysis/archiving.

    That is quite a strange response. Your article spent time basically explaining how your Gregorian calendar approach using a recursive CTE is the only "good" solution. (Afterall, you did title it the "Sins of SQL". Which, by the way, has absolutely nothing whatsoever to do with SQL.) Your solution to each of the holes in the approach is to simply start bolting on additional stuff. So, let me see... Use a recursive CTE that has to generate the set of dates each time and also can't be used to enforce any kind of integrity. Then when you need holidays, bolt on a look up table. Wait, need business defined week ending dates, quarter ending dates, and year ending dates - bolt on another lookup table or tables. Need to vary the start/end of a business week by country or even change the definition after several years while still preserving the previous definition - bolt on another lookup table. By the time you're done bolting on all of these special purpose lookup tables just to handle all of the permutations in a single cube that I'm dealing with, you have nothing but a mess that blatantly demonstrates your assertion about not needing a date table is completely full of holes that you never considered. There might be a tiny bit of merit on the time side, but your architecture on a set of dates wouldn't survive 5 minutes in any of the hundreds of environments where I've worked on or designed analytic systems.

    Michael Hotek

Viewing 15 posts - 31 through 45 (of 83 total)

You must be logged in to reply to this topic. Login to reply