Algorithm to combine 2 integer into one value and then disintegrate back into two

  • ScottPletcher - Friday, October 5, 2018 9:50 AM

    frederico_fonseca - Thursday, October 4, 2018 5:35 PM

    as a alternative method 

    constraints
    - PERIODYTD date is always greater or equal to PERIOD
    - difference between the 2 dates is less or equal 3647 -- just shy of 10 years (9 years, 12 month, 27 days)
    -- PERIODYTD must be >= 1900-01-01  and less than 2487-12-17 -- these values can change but interval will be same


    declare @ytd int = 20180131
    declare @per int = 20160731
    declare @Dt3 int
    set @Dt3 = 2180113

    select t.ytd
      , t.per
      , conv2.*
      , rev2.*
    from (select @ytd as ytd
        , @per as per
    ) t

    outer apply (select datediff(day, '1900-01-01', convert(date, convert(varchar(8), @ytd, 112))) as ytd_p1
          , datediff(day, convert(date, convert(varchar(8), @per, 112)), convert(date, convert(varchar(8), @Ytd, 112))) as per_p1
    ) conv1
    outer apply (select conv1.ytd_p1 * 10000 + conv1.per_p1 as final -- value to use as partition column
    ) conv2
    -- convert back
    outer apply (select conv2.final % 10000 as per_rev1
          , (conv2.final - (conv2.final % 10000)) / 10000 as ytd_rev1
    ) rev1
    outer apply (select convert(int, convert(varchar(8), dateadd(day, rev1.ytd_rev1, '1900-01-1'), 112)) as ytd_rev_final
          , convert(int, convert(varchar(8), dateadd(day, -1 * rev1.per_rev1, dateadd(day, rev1.ytd_rev1, '1900-01-1')), 112)) as per_ref_final
    ) rev2

    I have to say I do not see what you trying to accomplish by partitioning like this. Partitioning on the single date should be enough.
    But maybe you would like to enlighten us on the reasoning behind your decision.

    I don't think the integer concat approach really works too well when trying to recover the dates.  For example, try the above code with:

    declare @ytd int = 20220131 --future dates, but diff btwn dates < 7 yrs
    declare @per int = 20280731
     --future dates, but diff btwn dates < 7 yrs

    constraints 
    - PERIODYTD date is always greater or equal to PERIOD

  • frederico_fonseca - Friday, October 5, 2018 10:11 AM

    ScottPletcher - Friday, October 5, 2018 9:50 AM

    frederico_fonseca - Thursday, October 4, 2018 5:35 PM

    as a alternative method 

    constraints
    - PERIODYTD date is always greater or equal to PERIOD
    - difference between the 2 dates is less or equal 3647 -- just shy of 10 years (9 years, 12 month, 27 days)
    -- PERIODYTD must be >= 1900-01-01  and less than 2487-12-17 -- these values can change but interval will be same


    declare @ytd int = 20180131
    declare @per int = 20160731
    declare @Dt3 int
    set @Dt3 = 2180113

    select t.ytd
      , t.per
      , conv2.*
      , rev2.*
    from (select @ytd as ytd
        , @per as per
    ) t

    outer apply (select datediff(day, '1900-01-01', convert(date, convert(varchar(8), @ytd, 112))) as ytd_p1
          , datediff(day, convert(date, convert(varchar(8), @per, 112)), convert(date, convert(varchar(8), @Ytd, 112))) as per_p1
    ) conv1
    outer apply (select conv1.ytd_p1 * 10000 + conv1.per_p1 as final -- value to use as partition column
    ) conv2
    -- convert back
    outer apply (select conv2.final % 10000 as per_rev1
          , (conv2.final - (conv2.final % 10000)) / 10000 as ytd_rev1
    ) rev1
    outer apply (select convert(int, convert(varchar(8), dateadd(day, rev1.ytd_rev1, '1900-01-1'), 112)) as ytd_rev_final
          , convert(int, convert(varchar(8), dateadd(day, -1 * rev1.per_rev1, dateadd(day, rev1.ytd_rev1, '1900-01-1')), 112)) as per_ref_final
    ) rev2

    I have to say I do not see what you trying to accomplish by partitioning like this. Partitioning on the single date should be enough.
    But maybe you would like to enlighten us on the reasoning behind your decision.

    I don't think the integer concat approach really works too well when trying to recover the dates.  For example, try the above code with:

    declare @ytd int = 20220131 --future dates, but diff btwn dates < 7 yrs
    declare @per int = 20280731
     --future dates, but diff btwn dates < 7 yrs

    constraints 
    - PERIODYTD date is always greater or equal to PERIOD

    But does the actual data always match that constraint?

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Friday, October 5, 2018 11:01 AM

    But does the actual data always match that constraint?

    only the OP knows - this was more of an exercise of how it could be done if constraints are acceptable 
    And even the bit of being "- PERIODYTD date is always greater or equal to PERIOD" could be bypassed by adding a bit of code to deal with negative values - the 10 years one would still need to be in place though.
    But doing this would most likely prevent one of the benefits of partitioning by date as we could have 2 records with the same PERIODYTD being on totally opposed partitions

  • frederico_fonseca - Friday, October 5, 2018 11:12 AM

    ScottPletcher - Friday, October 5, 2018 11:01 AM

    But does the actual data always match that constraint?

    only the OP knows - this was more of an exercise of how it could be done if constraints are acceptable 
    And even the bit of being "- PERIODYTD date is always greater or equal to PERIOD" could be bypassed by adding a bit of code to deal with negative values - the 10 years one would still need to be in place though.
    But doing this would most likely prevent one of the benefits of partitioning by date as we could have 2 records with the same PERIODYTD being on totally opposed partitions

    Me personally, I would never do an initial design that had a constraint like that, since maybe 3-4 months later, a need for one vale to exceed the other could come up.  If I've designed an algorithm that can't handle that, then to change to a new algorithm, all existing data keys must be re'calc'd.  Ugh.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    I'd just use BIGINT. It might use up a bit more space but it will be readable by humans. If you have a 64-bit cpu and 64-bit SQL Server then the registers will be also be 64-bits so it shouldn't have much effect on internal processing either. It you;re worried about space you can probably save more by using either row or page compression on the table.

  • Jonathan AC Roberts - Saturday, October 6, 2018 4:12 AM

    _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    I'd just use BIGINT. It might use up a bit more space but it will be readable by humans. If you have a 64-bit cpu and 64-bit SQL Server then the registers will be also be 64-bits so it shouldn't have much effect on internal processing either. It you;re worried about space you can probably save more by using either row or page compression on the table.

    Not me.  It seems crazy to partition by both PeriodYTD and the Period (which seems to be based on the month-ending date).  Just do the partitioning on the latter of those two and life will be a whole lot easier.  It would get even easier if the period were based on the start of the month and used RIGHT for the partitioning function.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, October 6, 2018 9:32 AM

    Jonathan AC Roberts - Saturday, October 6, 2018 4:12 AM

    _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    I'd just use BIGINT. It might use up a bit more space but it will be readable by humans. If you have a 64-bit cpu and 64-bit SQL Server then the registers will be also be 64-bits so it shouldn't have much effect on internal processing either. It you;re worried about space you can probably save more by using either row or page compression on the table.

    Not me.  It seems crazy to partition by both PeriodYTD and the Period (which seems to be based on the month-ending date).  Just do the partitioning on the latter of those two and life will be a whole lot easier.  It would get even easier if the period were based on the start of the month and used RIGHT for the partitioning function.

    Won't that reduce the number of partitions? I'm not really sure what the two values indicate so I just suggested that as the OP wanted to partition on those two columns.

  • Jonathan AC Roberts - Saturday, October 6, 2018 12:16 PM

    Jeff Moden - Saturday, October 6, 2018 9:32 AM

    Jonathan AC Roberts - Saturday, October 6, 2018 4:12 AM

    _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    I'd just use BIGINT. It might use up a bit more space but it will be readable by humans. If you have a 64-bit cpu and 64-bit SQL Server then the registers will be also be 64-bits so it shouldn't have much effect on internal processing either. It you;re worried about space you can probably save more by using either row or page compression on the table.

    Not me.  It seems crazy to partition by both PeriodYTD and the Period (which seems to be based on the month-ending date).  Just do the partitioning on the latter of those two and life will be a whole lot easier.  It would get even easier if the period were based on the start of the month and used RIGHT for the partitioning function.

    Won't that reduce the number of partitions? I'm not really sure what the two values indicate so I just suggested that as the OP wanted to partition on those two columns.

    Maybe but that's easy to handle/increase if necessary.  Considering that the OP has 500 Million rows and that's spread across at least 2 years (24 months), that's only 20.8 million rows per partition if partitioned by month.

    Shifting gears and to be sure, the following rant isn't directed at you.

    My point is that, according to the original post, the OP is all bent out of shape using a BIG INT because of storage requirements and doesn't seem to give a hoot about how bad life can be when you use integers for dates instead of a proper temporal data type.  To compound that problem by adding PeriodYTD (whatever that actually is) with Period (which also seems rather undefined at this point) seems a bit insane especially if someone decides that they're going to change what a Period is defined as in the future... and don't say that won't happen because I've been through it too many times.  Twice for one company and once for several others.  Add to that, when a company finally figures out that time is actually a part of the period and that doing things by end-of-month date actually does leave out all but the very first instant of the last day of each month and I'm starting to think this whole partitioning project is being designed by spreadsheet users that have part time curtain hanging jobs rather than someone actually concerned with space, ease of maintainability, and flexibility in use.

    And, if space is such a premium on the system, then use the DATE datatype (3 bytes per) that maintains and allows temporal calculations instead of making the severe mistake of using INTs for a representation of DATE, never mind compounding the problem by trying to use two partitioning keys concatenated as one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jonathan AC Roberts - Saturday, October 6, 2018 4:12 AM

    _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    I'd just use BIGINT. It might use up a bit more space but it will be readable by humans. If you have a 64-bit cpu and 64-bit SQL Server then the registers will be also be 64-bits so it shouldn't have much effect on internal processing either. It you;re worried about space you can probably save more by using either row or page compression on the table.

    If it's 500M rows now, it'll be 1B rows soon. 4 bytes per row = 4GB more in the table.  That's quite a bit of space.  Since the number itself is meaningless anyway, humans don't need to be able to read it directly.  Just create a view that translates the number back to dates.  You never have to, nor should, see the original number when looking at the actual data in the row.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    We have several tables in the region of 20,000 million rows, partitioned in chunks of about 500 million rows. They're partitioned on date using a partition function, a fundamental component of partitioning. They're partitioned for ease of maintenance - we can load / remove hundreds of millions of rows in minutes. We may get some query performance lift as a result of partitioning but it's marginal and irrelevant.
    So: why are you planning to partition such a small table? Are you really storing dates as integers? Why can't you use a partition function on a single date rather than two?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ok to answer some of the questions : 
    1) why are you planning to partition such a small table?
    Because the existing month end processing which apparently updates this table( anything between 20%  to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.

    2) Are you really storing dates as integers?
    yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime

    3) Why can't you use a partition function on a single date rather than two? 
    That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.

  • _rohit_ - Monday, October 8, 2018 5:11 PM

    ok to answer some of the questions : 
    1) why are you planning to partition such a small table?
    Because the existing month end processing which apparently updates this table( anything between 20%  to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.

    2) Are you really storing dates as integers?
    yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime

    3) Why can't you use a partition function on a single date rather than two? 
    That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.

    Understood on #1.  It think you're attacking the wrong problem, though.  You should be attacking the slow code instead of trying to accommodate it.

    On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum.  What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?

    On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this.  The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form.  You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.

    How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates?  If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch.  Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.

    --===== I'm just doing one example.  You can replace the variables with column names and get rid
         -- of the variable declarations all together.
    DECLARE  @PERIODYTD INT = 20180131
            ,@PERIOD    INT = 20160731
    ;
    DECLARE  @COMBINED  BINARY(4)
    ;
    --===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
     SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
                      + CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
    ;
    --===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
     SELECT  PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
            ,PERIOD    = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, October 8, 2018 8:36 PM

    _rohit_ - Monday, October 8, 2018 5:11 PM

    ok to answer some of the questions : 
    1) why are you planning to partition such a small table?
    Because the existing month end processing which apparently updates this table( anything between 20%  to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.

    2) Are you really storing dates as integers?
    yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime

    3) Why can't you use a partition function on a single date rather than two? 
    That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.

    Understood on #1.  It think you're attacking the wrong problem, though.  You should be attacking the slow code instead of trying to accommodate it.

    On #2, you obviously think that using integers was a good decision to move away from datetime but you've posted on a 2016 forum.  What would be wrong with using the DATE datatype, which is only 3 bytes and still retains the ability to do temporal math without having to jump through conversion hoops?

    On number 3, I still think you're attacking the wrong problem but you're the one that will have to live with all of this.  The key here is that you're going to end up with 8 bytes if you want to keep your Ints in human readable form.  You mentioned hashes but discounted that because its not human readable, as well as not being able to reconstitute the original two integer dates.

    How do you feel about something that's able to handle both the "addition" of your two dates and the reconstitution of you (ugh!) integer dates?  If you're game for that and you don't mind the code breaking on 2019-06-07 (~61 years from now), then all we have to do is change the epoch.  Of course, this would have been a whole lot easier if you had used the DATE datatype but it can be done as a single INT.

    --===== I'm just doing one example.  You can replace the variables with column names and get rid
         -- of the variable declarations all together.
    DECLARE  @PERIODYTD INT = 20180131
            ,@PERIOD    INT = 20160731
    ;
    DECLARE  @COMBINED  BINARY(4)
    ;
    --===== This demonstrates combining the number of days since 1900-01-01 for both dates into only 4 bytes.
     SELECT @COMBINED = CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIODYTD)))
                      + CONVERT(BINARY(2),DATEDIFF(dd,0,CONVERT(CHAR(8),@PERIOD)))
    ;
    --===== This splits the two values apart and converts them back to (ugh!) ints that look like dates.
     SELECT  PERIODYTD = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,1,2)),0),112))
            ,PERIOD    = CONVERT(INT,CONVERT(CHAR(8),DATEADD(dd,CONVERT(INT,SUBSTRING(@COMBINED,3,2)),0),112))
    ;

    I agree, using DATE instead of INT would have been better but this is some legacy mammoth code written ages ago. The values, right from ETL to processing to warehousing to reporting are read in INTs with no temporal maths.
    I've been trying to avoid to move to DATE just because of the time I'll have to spend in analysis and testing. However, do you think there will be a significant improvement in index performance if i make the index skinny ( 4bytes to 3) , if yes then that may be my motivation.
    on Point 3 , i still want to convince myself to use only one of the existing columns as partition key, but the use cases speak otherwise. thanks for your code snippet .I guess I'll have to drop the idea of making it human readable somehow.

  • _rohit_ - Monday, October 8, 2018 5:11 PM

    ok to answer some of the questions : 
    1) why are you planning to partition such a small table?
    Because the existing month end processing which apparently updates this table( anything between 20%  to 70% ) and purges some data too slow at the moment. having partitions would mean i could run these DMLs in parallel on each partition ( via a different table) and merge them back into main table.I have already tested the parallel processing and CPU is good enough to accommodate that.

    2) Are you really storing dates as integers?
    yes, though still don't understand why that is a problem. Instead, i feel it was a good decision to move away from datetime

    3) Why can't you use a partition function on a single date rather than two? 
    That's because the reason for me to partition that i explained in point 1 ( processing and purging) has to happen on smaller datasets which are uniquely identified by combination of these 2 values. if i create partitions on larger datasets, that won't help me.

    You want to split your data into partitions so you can swap a partition into a "shadow table" on the same filegroup as the partition and operate on that shadow table. Most often partitioning is based on a date element such as month and the values in a partition would range from the first of the month inclusive to the first of the next month exclusive. What would the values be (of your two columns PERIODYTD and PERIOD) for the first and last rows of one of your partitions?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Why not create a partition table, in the table all the 'rules' you use to decide which partition a row should go, that could be any number of fields, then a tinyint (if the number of partions is less then 256) or a smallint (for a larger number of partitions). Then use this partition number in every row of the tables to partition.

    The number of partiontions is small compared to the number of rows, so the partition table hardly takes any space at all. The partitionnumber can be a minimum of what is needed, so for 500 E6 rows you only need 500 E6 tinyints (or smallints).

    The partition table can be very complex, for example based on a starting time, the period length the day of the week. You could even use multiple rows for each partition.

    Example (only theoretical).

    Partition table (logical content).
    sunday, march 2059, partition 1.
    monday-friday, march 2059, partition 2.
    saturday, march 2059, partition 200
    sunday, april 2059, partition 3.
    monday-friday, april 2059, partition 4.
    saturday, april 2059, partition 200

    So even with a sepperate partition line for each day of the week, the number of rows in the partion table is relative small.
    The amount of date because of the extra field in all datarows is very limited with the adding of the partionnumber.
    Special cases for example Null values or special values (X-mas day), can be easely handled as special cases in 'special' partintions. Or for example dates which are used as overlap between two periods. (For example the last 3 weeks of december should be treated as also belonging to the next year and get a partion on their own). Or black friday should have it's own partition. (???). Once a partition is 'removed' the partition number can be reused with new rules.

    There are multiple ways to fill the partition number into the data rows. Adding partitions rules is fairly simple. Changing partition rules is cumbersome.

    Is this idea totaly madness, or does is it worth consideration. (Hope to hear from you all).
    Ben

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply