Sins of SQL: The Time Table

  • deleted -- multiple posted due to blackberry issues

  • deleted -- multiple posted due to blackberry issues

  • deleted -- multiple posted due to blackberry issues

  • deleted -- multiple posted due to blackberry issues

  • All, please ignore the multiple posts above by me. Issues posting from my BlackBerry.

    Steve, if you could delete the dups, that would be nice.

    Lynn

  • Lynn,

    OK. Now I'm confused too.

    Dan.

  • dan_public (6/3/2010)


    Lynn,

    OK. Now I'm confused too.

    Dan.

    That's okay, I just went through this thread and I can't find anywhere where I critisized any of your responses. I have a feeling you have confused me with someone else in this thread unless you can find the post where I did this.

    Lynn

  • Lynn Pettis (6/3/2010)


    dan_public (6/3/2010)


    Lynn,

    OK. Now I'm confused too.

    Dan.

    That's okay, I just went through this thread and I can't find anywhere where I critisized any of your responses. I have a feeling you have confused me with someone else in this thread unless you can find the post where I did this.

    Lynn

    Lynn,

    My apologies. I got you confused with YSL Guru. The only thing I can think of is that I'm not used to the format of this forum software and simply looked in the wrong place.

    Again, my apologies.

    Dan.

  • I'd like to apologize to the author because I'll confess, I read the title and the first paragraph and was immediately enraged. This approach should never replace the Date and Time dimension tables in a data warehouse as the title of the article suggests. However, it could be used to build the date and time dimension tables (Though, I believe approaches using SSIS to be vastly superior and much, much more flexible.) This pure SQL approach is kinda cool though.

    Enough of the compliments because I still believe this article is garbage. First, stored date AND time dimensions are critical to a data warehouse projects success. If you don't need to account for holidays, workdays, special other time periods, multiple date/time hierarchies etc now, you will later and without an architecture that's able to adapt to changing business and reporting requirements efficiently and effectively, you'll fail. Word of advice. Every issue about dimensional modeling described in The Data Warehouse Toolkit, follow it. Though it doesn't touch on everything you need to know, the foundation of everything is there.

    Second, about SSAS on which this article seemed to be focused on, I can't stress enough that the Date and Time dimensions should never, ever, ever, ever be combined in a single dimension in a multidimensional database. More later on why not if someone doesn't beat me to the issue. Gotta go.

  • To continue on my previous post,

    Second, point reasoning. You ideally want your multidimensional structures to be dense as possible. Every combination of every member of every attribute hierarchy to be referenced somewhere in your partition stores. That's not going to happen obviously as for a retail system, not every customer buys every product on every day (and in Steve's Case, every second). But to combine date and time just increases the sparcity of the multidimensional database along a dimension that is most likely going to be used to partition a database of any size whatsoever. It would just be a huge mistake to do so. Like said earlier, separate date and time dimensions for 10 years worth of data at the minute resolution results in a dimension with over 5 million (where'd I get 13M from?) stored members while separate dimensions results in a date dimension with 3650 members and a time dimension with 1440 members. I don't need benchmarks to tell me the separate dimension approach will best the consolidated dimension approach in every aspect; dimension processing, aggregation processing, ease of maintenance. Everything.

    Thirdly, say you have a referenced date in your fact table but no foreign key to your date and time dimensions. You could use Steve's approach to reference a virtual TIME (and DATE) dimension using recursive CTEs to force into SSAS. But wait. If I convert a datetime type to an int [CAST(FactDate AS int)], excluding the time portion (assumes its AM), gives me the number of days elapsed since 1/1/1900. Hey, why not use that as the surrogate key of your date dimension! (Which is exactly what I and others do) I don't even have to do a look up for a date key at all in my ETL and if I haven't already, I could create a calculated column in my DSV. Similarly, I could use the number of minutes (or seconds) that have elapsed since midnight as the surrogate key of my time dimension so that I don't have to do a lookup for the Time Key either. (And could still create a calculated column for that key in my DSV) And because of star joins and the small size of the date and time dimensions, analysis of events between arbitrary date/time periods is lightning fast. (And not nearly as "evil" as the CTE Steve purposes as the replacement of the time dimension)

    So to summarize, I still would give this article zero stars. I think it subtracts, not adds, to the body of knowledge.

  • But to combine date and time just increases the sparcity of the multidimensional database along a dimension that is most likely going to be used to partition a database of any size whatsoever.

    The only change this would require for partitioning is that the partitions are built off of the date/time values itself, not arbitrary key ranges in the time table.

    It would just be a huge mistake to do so. Like said earlier, separate date and time dimensions for 10 years worth of data at the minute resolution results in a dimension with over 5 million (where'd I get 13M from?) stored members while separate dimensions results in a date dimension with 3650 members and a time dimension with 1440 members. I don't need benchmarks to tell me the separate dimension approach will best the consolidated dimension approach in every aspect; dimension processing, aggregation processing, ease of maintenance. Everything.

    If you have seperated date/time dimensions, the number of possible cells in the multidimensional space that need storing individually will be the same, since the conceptual space is the cross product of the dimensions. To give another example: Say I have a table of products with 8 things in it, a table of 100 customers, and 5 stores. A cube storing the sums may keep 4,000 distinct entries. If two of the dimensions are merged to form a hierarchy, since perhaps, each customer has a relationship 1:1 with a store, then you'd end up with a Customer (by store) dimension of cardinality 500. This gives, likewise, a dimensional space with up to 5000 cells.

    The same principle holds here for date/time. Whether you have a single dimension, or two, the maximum number of values themselves at the lowest level is absolutely the same. When it comes to aggregations etc, you'd only loose out on aggregation design if you needed to group by time values independantly of days - and you can keep things as simple/complex internally by using the appropriate hierarchy definitions that suit your querying needs.

    The only argument I've heard that seems to be based on anything is the notion that the large number of members overall in the dimension may cause slowness as the members of the hierarchy are resolved - although I'll admit that in my tests with around ~1B fact rows haven't had any demonstrable problems when linking it to a pseudo-time dimension covering a range of ten years. Each day the next few records are incrementally loaded into the date dimension and the days data is pushed in shortly thereafter - runs just as quickly as it used to beforehand. Part of the reason I suspect is the fact that SSAS does not linearlly scan a dimension looking for the appropriate key when loading data, but uses a tree. The effort of searching one tree of combined values, versus two smaller trees is is not materially higher - any basic algorithms class will tell you that (doubling the amount of data in a B+ tree/Binary tree does not double search time, since the performance of lookups is always logarithmic).

    But wait. If I convert a datetime type to an int [CAST(FactDate AS int)], excluding the time portion (assumes its AM), gives me the number of days elapsed since 1/1/1900. Hey, why not use that as the surrogate key of your date dimension! (Which is exactly what I and others do) I don't even have to do a look up for a date key at all in my ETL and if I haven't already, I could create a calculated column in my DSV. Similarly, I could use the number of minutes (or seconds) that have elapsed since midnight as the surrogate key of my time dimension so that I don't have to do a lookup for the Time Key either.

    (And could still create a calculated column for that key in my Dg fast.

    That approach is valid and workable, and one I've used myself where my reporting periods cover all possible date/time combinations. If you look at how I recommended things in the article, you'll notice this precisely the mechanism by which the milliseconds etc are normalized out of the data.

    (And not nearly as "evil" as the CTE Steve purposes as the replacement of the time dimension)

    I've yet to see any demonstration that the CTE itself is unsuitable for synthesising the data on-demand - it's not the fastest mechanism (and I've never claimed it would be), but it's also not an order of magnitute worse. Like procedural texture generation in 3D modelling systems, it trades some performance for the ability to switch to arbitrary levels of granularity by modifications to a single area.

    So to summarize, I still would give this article zero stars. I think it subtracts, not adds, to the body of knowledge.

    To each their own.

  • If you have seperated date/time dimensions, the number of possible cells in the multidimensional space that need storing individually will be the same, since the conceptual space is the cross product of the dimensions. To give another example: Say I have a table of products with 8 things in it, a table of 100 customers, and 5 stores. A cube storing the sums may keep 4,000 distinct entries. If two of the dimensions are merged to form a hierarchy, since perhaps, each customer has a relationship 1:1 with a store, then you'd end up with a Customer (by store) dimension of cardinality 500. This gives, likewise, a dimensional space with up to 5000 (correction - 4000) cells.

    Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.

    I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same.

    I'm done. No mas.

  • Steven James Gray (6/4/2010)


    ...

    I've yet to see any demonstration that the CTE itself is unsuitable for synthesising the data on-demand - it's not the fastest mechanism (and I've never claimed it would be), but it's also not an order of magnitute worse. Like procedural texture generation in 3D modelling systems, it trades some performance for the ability to switch to arbitrary levels of granularity by modifications to a single area.

    ...

    Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.

    Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?

    Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.

  • Martin Mason (6/4/2010)


    Hey. Why stop there! If separate date/time dimensions leads to the same number of possible cells in the resultant cube space as the a single date/time dimension, that same logic should apply to everything. If sound logic, then that principle should be true. Then why separate customer/product/store dimensions? An equally sound dimensional design would be to have a single fact table related to one dimension with a composite key of product/customer/store and the hierarchies built from that key.

    I never said it was a good idea to not have dimensions - but I think the argument for/against seperate/unified date time dimensions depends on your view of time. When I model things I tend to view date/time as a singular axis. For reporting that needs to look at time valuations, irrespective of their dates (not something that tends to be valuable in the work I do), the hierarchies support in SSAS allows those needs to be met. So whilst I'd still maintain seperate customer/product/store dimensions, I wouldn't ever opt for a seperate year/month dimensions.

    I hope we both agree that that design is NOT sound. Why that design is not sound is the same reasoning that the composite date/time dimension is not sound (Again, read chapter 2 of The Data Warehouse Toolkit for lengthy justification why it's not.) Because a Time member, i.e. 12:01 AM is not uniquely related to only one member of the date dimension. It's related to every member of the date dimension and the same justification to separate Product and Customer and Store is the same logic applied to separate date and time. A date is a date and a time is a time and they are not the same.

    The example I gave was purely to demonstrate the concept that the overal cardinality of the dimensional space is the same, regardless of which horse you ride when it comes to date/time tables, and merging store/product/customer dimensions would never be a real scenario someone would likely do.

    That particular book sits on my shelf, and whilst it has it's views on things, no one book is cannon. Time represents subdivisions of a date, and if it was genuinely so valuable to seperate date/time, why would you not maintain seperate year, month, hour, minute, second dimensions for the same reasons. There's always 60 seconds per minute, 60 minutes per hour, 12 months per year, etc - but you'd be hard pressed to find a solution out there that does things that way.

    Lynn Pettis (6/4/2010)


    Here is the problem, you are using an inefficient method of generating the data. Another developer is going to come around, see this code and say "Great, just what I need for what I am working on!" Problem, turns out this developer is trying to use it on multi-million row data sets and can't figure out why it doesn't work well.

    Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.

    Also, what happens if you start using this same code on much larger datasets and start seeing performace issues?

    ....

    Some very simple testing showed the inefficiences in the recursive CTE, including additional cpu time and IO requirements of the method.

    If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:

    SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60)

    This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.

    It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance.

    As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.

  • Steven James Gray (6/6/2010)


    Inefficiency is relative. As slow as it is, it's possible to generate enough data to cover a century in less than half an hour when dealing at the minute resolution, and without any effort applied to optimise what's there. Relative to the time and effort to maintain and keep on top of populating a time table manually, it seems like small-fry. You'd not reprocess the whole dimension every time you loaded a fact in to the tables either but rather only incrementally populate ranges as you go.

    Actually inefficiency is not relative, it may just not be noticable. Inefficient is still inefficient if there is another means to generate the same results that is better.

    If you look at the code, the recursive steps in the example given are broken into two recursive elements - the date range and the time one, rather than a single second-level recursive process. The query plans in this case take the date/time combinations and do a table spool over them, and then as each day value is generated it is cross-joined to the time combinations. That's a sum total of 36,500 total recursions per century, total! To prove this is the case try running the following query using the TVF from the article:

    SELECT * FROM dbo.[fn_GetTimestampRange]('1 January 2006','1 February 2006', 0) OPTION(MAXRECURSION 60)

    This will produce a rowset of 44641 rows, but never recurses more than 60 levels as a result of the optimisations obtained by the multiple re-use of the numbers expression within the function. The only part of it that actually grows in cost is the recursive step-per-day. Your maxrecursion value needs to be at least 60, and at most the number of distinct days in the range covered. As a result, this is substantially more efficient than your article would imply.

    It's possible to optimise this even further by doing the same trick when resolving all the possible year/month permutations, meaning the recursive cost would shrink further, but I thought that'd impact readability and not gain much for performance.

    Care to prove your statement? I'm pretty sure that my routine (with a slight modification thanks to you allowing me to see an ineffeciency in my code, which I have fixed in my blog) will consistantly beat your code in head to head competition.

    As it stands the current design has the advantage being an inlinable TVF (and most performance work would likely prevent that), which means it can be used directly in a DSV, and streams the rows as fast as they're produced, rather than waiting for all rows to be computed in a temp-table, as the multi-step TVF approach would do.

    Have you taken a close look at my code (either version)? It is also written as an inline-TVF and can be used in exactly the same fashion as yours. It has the benefit of being more efficient. After making a small modification to my code, I ran both functions to generate 100 years of data at one minute intervals, writing the output to temporary tables:

    set statistics io on;

    set statistics time on;

    select * into #Test1 from dbo.fn_GetTimestampRangeCTE ('2000-01-01','2100-01-01', 0)

    set statistics time off;

    set statistics io off;

    set statistics io on;

    set statistics time on;

    select * into #Test2 from dbo.fn_GetTimestampRangeRCTE ('2000-01-01','2100-01-01', 0) OPTION (MAXRECURSION 0);

    set statistics time off;

    set statistics io off;

    SQL Server Execution Times:

    CPU time = 432079 ms, elapsed time = 453693 ms.

    (52596001 row(s) affected)

    Table 'Worktable'. Scan count 8, logical reads 2046345, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1525828 ms, elapsed time = 1553694 ms.

    (52596001 row(s) affected)

    Once again, you'll see that your function, using a recursive CTE is quite a bit slower. And even generating 52,596,001 rows of data, mine still isn't generating additional IO like the recursive CTE does.

Viewing 15 posts - 61 through 75 (of 83 total)

You must be logged in to reply to this topic. Login to reply