Choosing the Birthday of a Customer by month?

  • Jason A. Long - Monday, November 27, 2017 11:16 PM

    Here's a quick test, using just over 3.25 million rows of data in both dbo.customer_1 and dbo.customer_2.

    Aside from the persisted computed columns on 2, both tables identical with identical data.

    The results are below...
    -- DROP TABLE dbo.customer_1
    CREATE TABLE dbo.customer_1 (
        cust_id INT NOT NULL IDENTITY(1,1),
        last_name VARCHAR(30) NOT NULL
            CONSTRAINT df_lastlame DEFAULT (''),
        first_name VARCHAR(30) NOT NULL
            CONSTRAINT df_firstname DEFAULT (''),
        date_of_birth DATE NOT NULL
            CONSTRAINT df_dateofbirth DEFAULT ('19000101')
        CONSTRAINT pk_customer PRIMARY KEY (cust_id)
        );
    GO

    INSERT dbo.customer_1 (last_name, first_name, date_of_birth)
    SELECT DISTINCT
        LAG(r.PatientNameLast, 1, 'Doe') OVER (ORDER BY (SELECT NULL)),
        LEAD(r.PatientNameFirst, 1, 'Jane') OVER (ORDER BY (SELECT NULL)),
        r.PatientBirthDate
    FROM
        dbo.Referral r
    WHERE
        r.PatientBirthDate > '19000101';
    GO

    CREATE NONCLUSTERED INDEX ix_dateofbirth ON dbo.customer_1 (date_of_birth)
    INCLUDE (cust_id, last_name, first_name);
    GO

    -- DROP TABLE dbo.customer_2
    CREATE TABLE dbo.customer_2 (
        cust_id INT NOT NULL IDENTITY(1,1),
        last_name VARCHAR(30) NOT NULL
            CONSTRAINT df_lastlame2 DEFAULT (''),
        first_name VARCHAR(30) NOT NULL
            CONSTRAINT df_firstname2 DEFAULT (''),
        date_of_birth DATE NOT NULL
            CONSTRAINT df_dateofbirth2 DEFAULT ('19000101'),
        b_month AS CONVERT(TINYINT, MONTH(date_of_birth)) PERSISTED,
        b_day AS CONVERT(TINYINT, DAY(date_of_birth)) PERSISTED,
        CONSTRAINT pk_customer2 PRIMARY KEY (cust_id)
        );
    GO

    INSERT dbo.customer_2 (last_name, first_name, date_of_birth)
    SELECT
        c1.last_name,
        c1.first_name,
        c1.date_of_birth
    FROM
        dbo.customer_1 c1;
    GO

    CREATE NONCLUSTERED INDEX ix_bday_bmonth_2 ON dbo.customer_2 (b_day, b_month)
    INCLUDE (cust_id, last_name, first_name, date_of_birth)
    GO

    -- DROP TABLE dbo.dob_calendar
    CREATE TABLE dbo.dob_calendar (
        date_of_birth DATE NOT NULL,
        b_month AS CONVERT(TINYINT, MONTH(date_of_birth)) PERSISTED,
        b_day AS CONVERT(TINYINT, DAY(date_of_birth)) PERSISTED,
        CONSTRAINT pk_dobcalendar PRIMARY KEY CLUSTERED (b_day, b_month, date_of_birth)
        );
    GO

    INSERT dbo.dob_calendar (date_of_birth)
    SELECT
        DATEADD(DAY, 0, t.n)
    FROM
        dbo.tfn_Tally(DATEDIFF(DAY, 0, GETDATE()), 0) t;    -- dbo.tfn_Tally is a "ben-gan" style tally/numbers function. I think we've all got one by now, but I'll post the code if anyone wants it...
    GO

    ALTER INDEX pk_dobcalendar ON dbo.dob_calendar REBUILD;
    GO

    --==========================================================================================

    DBCC DROPCLEANBUFFERS;
    GO
    WAITFOR DELAY '00:00:01';
    GO
    DECLARE @_start DATETIME2(7) = SYSDATETIME();
    PRINT(CONCAT('************** start join dob_calendar: ', @_start, ' **************************'));
    SET STATISTICS IO ON;
    ---------------------
        SELECT
            cst.cust_id,
            cst.last_name,
            cst.first_name,
            cst.date_of_birth
        FROM
            dbo.customer_1 cst
            JOIN dbo.dob_calendar dc
                ON cst.date_of_birth = dc.date_of_birth
        WHERE
            dc.b_month = 11
            AND dc.b_day = 27;
    ---------------------
    PRINT(CONCAT('************** finish join dob_calendar... duration in ms: , DATEDIFF(mcs, @_start, SYSDATETIME()) / 1000.0, ' **************************

    '));
    GO

    DBCC DROPCLEANBUFFERS;
    GO
    WAITFOR DELAY '00:00:01';
    GO
    DECLARE @_start DATETIME2(7) = SYSDATETIME();
    PRINT(CONCAT('************** start where exists dob_calendar: ', @_start, ' **************************'));
    SET STATISTICS IO ON;
    ---------------------
        SELECT
            cst.cust_id,
            cst.last_name,
            cst.first_name,
            cst.date_of_birth
        FROM
            dbo.customer_1 cst
        WHERE
            EXISTS (
                    SELECT 1
                    FROM dbo.dob_calendar dc
                    WHERE cst.date_of_birth = dc.date_of_birth
                        AND dc.b_month = 11
                        AND dc.b_day = 27
                    );
    ---------------------
    PRINT(CONCAT('************** finish where exists dob_calendar... duration in ms: ', DATEDIFF(mcs, @_start, SYSDATETIME()) / 1000.0, ' **************************

    '));
    GO

    DBCC DROPCLEANBUFFERS;
    GO
    WAITFOR DELAY '00:00:01';
    GO
    DECLARE @_start DATETIME2(7) = SYSDATETIME();
    PRINT(CONCAT('************** start persisted cols: ', @_start, ' **************************'));
    SET STATISTICS IO ON;
    ---------------------
        SELECT
            cst.cust_id,
            cst.last_name,
            cst.first_name,
            cst.date_of_birth
        FROM
            dbo.customer_2 cst
        WHERE
            cst.b_month = 11
            AND cst.b_day = 27;
    ---------------------
    PRINT(CONCAT('************** finish persisted cols... duration in ms: ', DATEDIFF(mcs, @_start, SYSDATETIME()) / 1000.0, ' **************************

    '));
    GO

    Results... ((query options: discard results after execution)

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    ************** start join dob_calendar: 2017-11-28 00:22:15.5127477 **************************

    (8217 rows affected)
    Table 'customer_1'. Scan count 117, logical reads 910, physical reads 3, read-ahead reads 576, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'dob_calendar'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    ************** finish join dob_calendar... duration in ms: 289.054000 **************************

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    ************** start where exists dob_calendar: 2017-11-28 00:22:17.2769080 **************************

    (8217 rows affected)
    Table 'customer_1'. Scan count 117, logical reads 910, physical reads 3, read-ahead reads 576, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'dob_calendar'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    ************** finish where exists dob_calendar... duration in ms: 275.049000 **************************

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.
    ************** start persisted cols: 2017-11-28 00:22:19.0060602 **************************

    (8221 rows affected)
    Table 'customer_2'. Scan count 1, logical reads 39, physical reads 2, read-ahead reads 36, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    ************** finish persisted cols... duration in ms: 257.038000 **************************

    All 3 plowed through 3.2+ million rows in ~1/4 second. Given the fact that this isn't likely to be executed more than once or twice a day... me personally... I'd be inclined to go with the "dob_calendar" table option and avoid the negative consequences of having the computed columns. The "dob_calendar" table is cheap to create & populate, easy to maintain and won't impact other queries hitting the original table (yes there is the covering index on date_of_birth... but... AFAIC, that's negated by the fact that a similar index would be required by the persisted computed columns solution).
    That said, Jeff has a long history of finding flaws and/or holes in my logic. So, there's a good chance that tradition will continue and I'll either learn something new or gain a new perspective.

    I hope you never take it negatively.  Especially since I always learn from our discussions. 😉

    A couple of quick questions.
    1.  What would you use to maintain the separate table?
    2.  What are the "negative consequences" of having a persisted computed column, in this particular case?
    3.  Also, the original problem was to find everyone with a given birth-month.  Or, at least that is my impression.  I'm probably missing some logic but why are you looking only for people that have the same day of the same month (11/27)?

    Here's the part of the OP's original post that I'm speaking of.

    Looking to write a simple query that will return the Birthday's, by month if I choose a Date Range in the month spcified
    So this code (below) because I chose January, will ignore the year and day.

    select cust_id, cust_fname, cust_lname, cust_dob from birthdays
    where cust_active = 1 and cust_dob between '1995-01-26 00:00:00.000' and '1995-01-27 00:00:00.000'

    So that query (should) will return all birthdays in the month of Jan, no matter the year or date.
    The obvious is the @MonthStart code above, I just dont know how to lay it out correctly.                                 

    Also, no one can execute your tests because they don't have your "Referral" table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @chef423 ,

    Would you post the CREATE TABLE statement (with all the indexes) for you're Customer table, please?  I'd like to generate some test data for all of this because, as a very wise man once said, "One good test is worth a thousand expert opinions".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, November 28, 2017 5:45 AM

    I hope you never take it negatively.  Especially since I always learn from our discussions. 😉

    A couple of quick questions.
    1.  What would you use to maintain the separate table?
    2.  What are the "negative consequences" of having a persisted computed column, in this particular case?
    3.  Also, the original problem was to find everyone with a given birth-month.  Or, at least that is my impression.  I'm probably missing some logic but why are you looking only for people that have the same day of the same month (11/27)?

    Here's the part of the OP's original post that I'm speaking of.

    Looking to write a simple query that will return the Birthday's, by month if I choose a Date Range in the month spcified
    So this code (below) because I chose January, will ignore the year and day.

    select cust_id, cust_fname, cust_lname, cust_dob from birthdays
    where cust_active = 1 and cust_dob between '1995-01-26 00:00:00.000' and '1995-01-27 00:00:00.000'

    So that query (should) will return all birthdays in the month of Jan, no matter the year or date.
    The obvious is the @MonthStart code above, I just dont know how to lay it out correctly.                                 

    Also, no one can execute your tests because they don't have your "Referral" table.

    I hope you never take it negatively.  Especially since I always learn from our discussions. 😉


    Absolutely not! Don't get me wrong, I like being right more than I like being wrong... but... In general, I try to work with the idea that I might be wrong. 
    That's not "lack of self confidence" thing, nor is it limited to SQL Server... It's been my experience that the biggest blunders are cause by people who, 1) "balls to bones", believe something that simply isn't so. 2) have egos that won't allow them to say, "I don't know", acknowledge any of their own short comings or admit to their mistakes. 3) have simply stopped making an effort to learn.
    The "I might be wrong" thing is just me actively doing my best not to be one of those people...
    So... While my ego enjoys being right, I know that being proven wrong is what will make be better tomorrow than I was yesterday. 

    A couple of quick questions.
    1.  What would you use to maintain the separate table?
    2.  What are the "negative consequences" of having a persisted computed column, in this particular case?
    3.  Also, the original problem was to find everyone with a given birth-month.  Or, at least that is my impression.  I'm probably missing some logic but why are you looking only for people that have the same day of the same month (11/27)?


    a1. I'd be inclined to either run the run it out to the current month and use one of the end of month jobs to add to it over time... or... Given it's size, I might even be open to the idea of running 30 years in advance and let it be a surprise to the guy or gal that has by job 30 years from now.
    a2. I knew when I wrote it I'd get called on it... Now that I am called on it, I probably should have used different phrasing... Yea, it's a 3NF violation (the transitive dependency thing) but I was actually thinking about the problems that come of using scalar UDFs when I wrote it... Clearly not applicable here. That said, I would still prefer to avoid the NF violation...
    a3. I didn't actually go back the start of the thread. I got drawn into the back & forth between you and Scott and that prompted me to knock out a quick & dirty test just to satisfy my own curiosity. It wasn't necessarily intended to be an answer to the OP's actual question or satisfy his specific requirements.
    That said, I did use "b_day" as the left key column due to it's higher selectivity... Looking at it today, swapping the the b_month into the left position would have allowed both "month-day" searches and "moth only" searches... Chalk it up to "midnight tunnel vision"...

    Looking to write a simple query that will return the Birthday's, by month if I choose a Date Range in the month spcified
    So this code (below) because I chose January, will ignore the year and day.

    select cust_id, cust_fname, cust_lname, cust_dob from birthdays
    where cust_active = 1 and cust_dob between '1995-01-26 00:00:00.000' and '1995-01-27 00:00:00.000'

    So that query (should) will return .So that query (should) will return all birthdays in the month of Jan, no matter the year or dateall birthdays in the month of Jan, no matter the year or date.
    The obvious is the @MonthStart code above, I just dont know how to lay it out correctly.                                  


    See "a3" above... My post was more about my interest in the back & forth between you & Scott than the the OP question...

    Also, no one can execute your tests because they don't have your "Referral" table.

    Pure laziness on my part... When I wrote it, I was just trying to satisfy my own curiosity with the fewest possible key strokes. I had no intention of actually getting into the conversation. The only reason I made the post was because I thought  the massive difference in IO compared minute difference in execution time was interesting.

  • As I stated in my comments, IF the queries are (almost) always for birthdays in a month / range of days, a clustered index on a base-dated dob should give the best performance.  And if upcoming birthdays are being viewed, presumably you'd want to see them in month-day order.  So the code is something like:

    where monthday_of_birth >= '19000101' and monthday_of_birth < '19000201'
    order by monthday_of_birth

    No extra indexes required, no sort required (again, given my stated condition that table usage was (almost) always for month/month-day range.

    If you need to limit to certain age(s), the year_of_birth would obviously be compared as well.

    Yes, there will be index fragmentation, but since:
    1) The huge overhead of additional covering indexes is not needed.  If not clus by date, the "tipping point" would be easy to reach (one would expect even 1 month, roughly 8.5% of rows, to easily reach it), forcing a table scans unless you created covering index(es).
    2) All tables have some fragmentation, even "sacred" identity-clustered tables, sometimes even large amounts of fragmentation.  Although, yes, naturally they are less prone to it.  Identity clustering definitely has its place, but it should not be default for any table.
    3) Since SELECTs are orders-of-magnitude more frequent than INSERTs, sometimes by 1M to 1, the overall I/Os should still be less.  Rebuilds are available, and in my case would be online anyway.  The I/O to rebuild periodically should be way less than the I/O to maintain extra indexes and sort the results.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Tuesday, November 28, 2017 11:02 AM

    As I stated in my comments, IF the queries are (almost) always for birthdays in a month / range of days, a clustered index on a base-dated dob should give the best performance.  And if upcoming birthdays are being viewed, presumably you'd want to see them in month-day order.  So the code is something like:

    where monthday_of_birth >= '19000101' and monthday_of_birth < '19000201'
    order by monthday_of_birth

    No extra indexes required, no sort required (again, given my stated condition that table usage was (almost) always for month/month-day range.

    If you need to limit to certain age(s), the year_of_birth would obviously be compared as well.

    Yes, there will be index fragmentation, but since:
    1) The huge overhead of additional covering indexes is not needed.  If not clus by date, the "tipping point" would be easy to reach (one would expect even 1 month, roughly 8.5% of rows, to easily reach it), forcing a table scans unless you created covering index(es).
    2) All tables have some fragmentation, even "sacred" identity-clustered tables, sometimes even large amounts of fragmentation.  Although, yes, naturally they are less prone to it.  Identity clustering definitely has its place, but it should not be default for any table.
    3) Since SELECTs are orders-of-magnitude more frequent than INSERTs, sometimes by 1M to 1, the overall I/Os should still be less.  Rebuilds are available, and in my case would be online anyway.  The I/O to rebuild periodically should be way less than the I/O to maintain extra indexes and sort the results.

    But why would you query the Customer table (almost) always for birthdays in a month/range of days?  Remember, the op stated the actual table is a Customer table.

  • Lynn Pettis - Tuesday, November 28, 2017 11:07 AM

    ScottPletcher - Tuesday, November 28, 2017 11:02 AM

    As I stated in my comments, IF the queries are (almost) always for birthdays in a month / range of days, a clustered index on a base-dated dob should give the best performance.  And if upcoming birthdays are being viewed, presumably you'd want to see them in month-day order.  So the code is something like:

    where monthday_of_birth >= '19000101' and monthday_of_birth < '19000201'
    order by monthday_of_birth

    No extra indexes required, no sort required (again, given my stated condition that table usage was (almost) always for month/month-day range.

    If you need to limit to certain age(s), the year_of_birth would obviously be compared as well.

    Yes, there will be index fragmentation, but since:
    1) The huge overhead of additional covering indexes is not needed.  If not clus by date, the "tipping point" would be easy to reach (one would expect even 1 month, roughly 8.5% of rows, to easily reach it), forcing a table scans unless you created covering index(es).
    2) All tables have some fragmentation, even "sacred" identity-clustered tables, sometimes even large amounts of fragmentation.  Although, yes, naturally they are less prone to it.  Identity clustering definitely has its place, but it should not be default for any table.
    3) Since SELECTs are orders-of-magnitude more frequent than INSERTs, sometimes by 1M to 1, the overall I/Os should still be less.  Rebuilds are available, and in my case would be online anyway.  The I/O to rebuild periodically should be way less than the I/O to maintain extra indexes and sort the results.

    But why would you query the Customer table (almost) always for birthdays in a month/range of days?  Remember, the op stated the actual table is a Customer table.

    Birthday party hosting company? Other special event seller that happens more often on bdays?  The real point is to cluster a table based on its actual needs and not on a predefined,  super-simplistic rule of thumb. There's no problem with a clus key longer than 4 bytes and/or with multiple columns if that's what the table actually needs.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I'd assume that "birthdays next month" is not only kind of queries which might be performed in the company.
    I'd expect also "birthdays next week", "birthdays on Chritmas holidays", etc.
    On top of it - it would be desirable if returned recordsets would be sorted like "neares birthdays first".

    Considering that I'd create "yearless dob" column, which would make it easy to fulfil any possible requirements of those kinds.

    _____________
    Code for TallyGenerator

  • All the speculation is good... I was hoping the OP would return with the definition of the Customer table so that we can demonstrate the performance of our speculations but seems to have flown the coop on this one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 8 posts - 31 through 37 (of 37 total)

You must be logged in to reply to this topic. Login to reply