October 9, 2018 at 7:13 am
ben.brugman - Tuesday, October 9, 2018 2:03 AMWhy not create a partition table, in the table all the 'rules' you use to decide which partition a row should go, that could be any number of fields, then a tinyint (if the number of partions is less then 256) or a smallint (for a larger number of partitions). Then use this partition number in every row of the tables to partition.The number of partiontions is small compared to the number of rows, so the partition table hardly takes any space at all. The partitionnumber can be a minimum of what is needed, so for 500 E6 rows you only need 500 E6 tinyints (or smallints).
The partition table can be very complex, for example based on a starting time, the period length the day of the week. You could even use multiple rows for each partition.
Example (only theoretical).
Partition table (logical content).
sunday, march 2059, partition 1.
monday-friday, march 2059, partition 2.
saturday, march 2059, partition 200
sunday, april 2059, partition 3.
monday-friday, april 2059, partition 4.
saturday, april 2059, partition 200So even with a sepperate partition line for each day of the week, the number of rows in the partion table is relative small.
The amount of date because of the extra field in all datarows is very limited with the adding of the partionnumber.
Special cases for example Null values or special values (X-mas day), can be easely handled as special cases in 'special' partintions. Or for example dates which are used as overlap between two periods. (For example the last 3 weeks of december should be treated as also belonging to the next year and get a partion on their own). Or black friday should have it's own partition. (???). Once a partition is 'removed' the partition number can be reused with new rules.There are multiple ways to fill the partition number into the data rows. Adding partitions rules is fairly simple. Changing partition rules is cumbersome.
Is this idea totaly madness, or does is it worth consideration. (Hope to hear from you all).
Ben
The OP already stated that he want's to partition and actually knows how to partition. If you read back through this post, the problem is that he want's/needs to partition by two columns and, since you can't actually do that, come up with a method for combining two dates. Since he has 500 million rows, he's trying to keep the width of such a column to a minimum even though the dates in the legacy table are based on INTs.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2018 at 7:13 am
ChrisM@Work - Tuesday, October 9, 2018 1:48 AM_rohit_ - Monday, October 8, 2018 5:11 PMok to answer some of the questions :
1) why are you planning to partition such a small table?
Because the existing month end processing which apparently updates this table( anything between 20% to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.2) Are you really storing dates as integers?
yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime3) Why can't you use a partition function on a single date rather than two?
That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.You want to split your data into partitions so you can swap a partition into a "shadow table" on the same filegroup as the partition and operate on that shadow table. Most often partitioning is based on a date element such as month and the values in a partition would range from the first of the month inclusive to the first of the next month exclusive. What would the values be (of your two columns PERIODYTD and PERIOD) for the first and last rows of one of your partitions?
As I understand it, it's because his PeriodYTD and Period columns are both needed to define the periods that support his SWITCHes in the most efficient manner. I've run the poor bugger through the gauntlet on this one.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2018 at 7:42 am
Jeff Moden - Tuesday, October 9, 2018 7:13 AMThe OP already stated that he want's to partition and actually knows how to partition. If you read back through this post, the problem is that he want's/needs to partition by two columns and, since you can't actually do that, come up with a method for combining two dates. Since he has 500 million rows, he's trying to keep the width of such a column to a minimum even though the dates in the legacy table are based on INTs.
As I understood is that the OP allready has decided that he wants (or needed) an extra column. The column needs to be at least wide enough to containthe max value for the number of partitions. My suggestion is to use the minimum size integer for that. So if he wants 200 partitions use a tiny int. If he wants a 1000 partitions use a smallint for the column.
An extra table is used to resolve which partitionnumber is used for a row. Starting with the data (two dates ?) which in the end determines the partition, some manipulation and the extra table should result in a single partition number for each row. Somewhere in the code this number should be 'assigned' to the row in the 'extra' column. Using such a table, complex rulings can be expressed in the table and simple code.
Adding such a table, decreases the complexity of the implementation and increases the possibilities.
Ben
October 9, 2018 at 7:57 am
Jeff Moden - Monday, October 8, 2018 8:36 PM_rohit_ - Monday, October 8, 2018 5:11 PMok to answer some of the questions :
1) why are you planning to partition such a small table?
Because the existing month end processing which apparently updates this table( anything between 20% to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.2) Are you really storing dates as integers?
yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime3) Why can't you use a partition function on a single date rather than two?
That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.Understood on #1. It think you're attacking the wrong problem, though. You should be attacking the slow code instead of trying to accommodate it.
On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum. What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?
On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this. The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form. You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.
How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates? If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch. Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.
--===== I'm just doing one example. You can replace the variables with column names and get rid
-- of the variable declarations all together.
DECLARE @PERIODYTD INT = 20180131
,@PERIOD INT = 20160731
;
DECLARE @COMBINED BINARY(4)
;
--===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
+ CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
;
--===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
SELECT PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
,PERIOD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
;
Jeff - 2019-06-07 is 8 months from now. Was that really the date you meant? Sorry was trying to follow along your line of reasoning.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 9, 2018 at 8:11 am
ben.brugman - Friday, October 5, 2018 2:46 AMsgmunson - Thursday, October 4, 2018 9:46 AMSo, basically, you are recommending a date data type... It should bother you....in some way...Can you explain why this should bother me ?
I do understand that; if possible use the existing types. But if there is a requirement for compact datastorage of two dates in a single column, then using an integer with a small and sufficient range for 'datenumbers' is a way to go. (Why the TP wants this, that is something I have not considered).
Ben
Easy. Now you have to be in charge of enforcing a data type that has already been invented and SQL Server already knows how to enforce. Even without any of a number of possible "schemes" to perpetuate this going on; that are probably bad ideas all by themselves; you're still effectively re-inventing the date data type when it's not at all necessary, and imposes a significant extra processing burden. If you are going to try and gain something from partitioning, it's rather pointless if you are just going to have to add a crap ton of processing in the middle just to make it possible. The price is too high for the benefit.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
October 9, 2018 at 8:22 am
sgmunson - Tuesday, October 9, 2018 8:11 AMThe price is too high for the benefit.
In General I do agree, but sometimes (maybe not this time, maybe rarely) the benefit is higher than the price.
Thanks for your anwser.
Ben
October 9, 2018 at 8:28 am
Matt Miller (4) - Tuesday, October 9, 2018 7:57 AMJeff Moden - Monday, October 8, 2018 8:36 PM_rohit_ - Monday, October 8, 2018 5:11 PMok to answer some of the questions :
1) why are you planning to partition such a small table?
Because the existing month end processing which apparently updates this table( anything between 20% to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.2) Are you really storing dates as integers?
yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime3) Why can't you use a partition function on a single date rather than two?
That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.Understood on #1. It think you're attacking the wrong problem, though. You should be attacking the slow code instead of trying to accommodate it.
On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum. What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?
On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this. The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form. You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.
How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates? If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch. Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.
--===== I'm just doing one example. You can replace the variables with column names and get rid
-- of the variable declarations all together.
DECLARE @PERIODYTD INT = 20180131
,@PERIOD INT = 20160731
;
DECLARE @COMBINED BINARY(4)
;
--===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
+ CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
;
--===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
SELECT PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
,PERIOD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
;Jeff - 2019-06-07 is 8 months from now. Was that really the date you meant? Sorry was trying to follow along your line of reasoning.
No. Thanks for the catch. It should have ben 2079-06-07. Really bad typo on my part.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2018 at 8:42 am
Jeff Moden - Tuesday, October 9, 2018 8:28 AMMatt Miller (4) - Tuesday, October 9, 2018 7:57 AMJeff Moden - Monday, October 8, 2018 8:36 PM_rohit_ - Monday, October 8, 2018 5:11 PMok to answer some of the questions :
1) why are you planning to partition such a small table?
Because the existing month end processing which apparently updates this table( anything between 20% to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.2) Are you really storing dates as integers?
yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime3) Why can't you use a partition function on a single date rather than two?
That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.Understood on #1. It think you're attacking the wrong problem, though. You should be attacking the slow code instead of trying to accommodate it.
On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum. What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?
On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this. The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form. You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.
How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates? If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch. Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.
--===== I'm just doing one example. You can replace the variables with column names and get rid
-- of the variable declarations all together.
DECLARE @PERIODYTD INT = 20180131
,@PERIOD INT = 20160731
;
DECLARE @COMBINED BINARY(4)
;
--===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
+ CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
;
--===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
SELECT PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
,PERIOD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
;Jeff - 2019-06-07 is 8 months from now. Was that really the date you meant? Sorry was trying to follow along your line of reasoning.
No. Thanks for the catch. It should have ben 2079-06-07. Really bad typo on my part.
Ok and that makes sense since that's the date when a smallint would then roll over/overflow, correct? (65536 days from 1900-01-01)
That said - won't the OP run into issues long before then since you can only have 15000 partitions?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 9, 2018 at 9:20 am
Jeff Moden - Monday, October 8, 2018 8:36 PM_rohit_ - Monday, October 8, 2018 5:11 PMok to answer some of the questions :
1) why are you planning to partition such a small table?
Because the existing month end processing which apparently updates this table( anything between 20% to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.2) Are you really storing dates as integers?
yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime3) Why can't you use a partition function on a single date rather than two?
That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.Understood on #1. It think you're attacking the wrong problem, though. You should be attacking the slow code instead of trying to accommodate it.
On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum. What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?
On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this. The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form. You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.
How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates? If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch. Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.
--===== I'm just doing one example. You can replace the variables with column names and get rid
-- of the variable declarations all together.
DECLARE @PERIODYTD INT = 20180131
,@PERIOD INT = 20160731
;
DECLARE @COMBINED BINARY(4)
;
--===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
+ CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
;
--===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
SELECT PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
,PERIOD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
;
Isn't that exactly the same logic I posted earlier? To quote my earlier post:
The algorithm is:
1) convert each date to the number of days since 1900-01-01 (or 1950-01-01 or other base date; the further you push forward the base date, the longer you can go, and the more positive key values you will have)
2) convert those values to varbinary(2)
3) concatenate the two varbinary(2) values into a single varbinary(4) value
4) convert that varbinary(4) to an int, which will be the date key value
5) to get original values back, reverse the process:
convert the int value to varbinary(4)
spit the varbinary(4) into two varbinary(2) values
convert the varbinary(2) values to ints
add those int values to the base date to get the original dates back.
This should work OK for all dates thru 20781231, as far as I can tell.
If you make the base date 1950 rather than 1900, that would take you well into the 22nd century.Here's sample code that shows the process, using 1950 as the base date:
...SQL code...
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
October 9, 2018 at 9:05 pm
ScottPletcher - Tuesday, October 9, 2018 9:20 AMJeff Moden - Monday, October 8, 2018 8:36 PM_rohit_ - Monday, October 8, 2018 5:11 PMok to answer some of the questions :
1) why are you planning to partition such a small table?
Because the existing month end processing which apparently updates this table( anything between 20% to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.2) Are you really storing dates as integers?
yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime3) Why can't you use a partition function on a single date rather than two?
That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.Understood on #1. It think you're attacking the wrong problem, though. You should be attacking the slow code instead of trying to accommodate it.
On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum. What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?
On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this. The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form. You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.
How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates? If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch. Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.
--===== I'm just doing one example. You can replace the variables with column names and get rid
-- of the variable declarations all together.
DECLARE @PERIODYTD INT = 20180131
,@PERIOD INT = 20160731
;
DECLARE @COMBINED BINARY(4)
;
--===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
+ CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
;
--===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
SELECT PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
,PERIOD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
;Isn't that exactly the same logic I posted earlier? To quote my earlier post:
The algorithm is:
1) convert each date to the number of days since 1900-01-01 (or 1950-01-01 or other base date; the further you push forward the base date, the longer you can go, and the more positive key values you will have)
2) convert those values to varbinary(2)
3) concatenate the two varbinary(2) values into a single varbinary(4) value
4) convert that varbinary(4) to an int, which will be the date key value
5) to get original values back, reverse the process:
convert the int value to varbinary(4)
spit the varbinary(4) into two varbinary(2) values
convert the varbinary(2) values to ints
add those int values to the base date to get the original dates back.
This should work OK for all dates thru 20781231, as far as I can tell.
If you make the base date 1950 rather than 1900, that would take you well into the 22nd century.Here's sample code that shows the process, using 1950 as the base date:
...SQL code...
Dunno... didn't look at your post before because there was no code. Now that I look at it, I'd have to say, no... not exactly the same because you used VARBINARY 😉 but will say, yep... it looks very similar.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 10 posts - 31 through 39 (of 39 total)
You must be logged in to reply to this topic. Login to reply