The Scientific Method: a call to action

  • The article certainly isn't about CHAR v.s. VARCHAR v.s. INT PKs. She wisely picked an example that's not only common and, sometimes, hotly contested, but one that's incredibly simple and easy to understand by people that might have no clue as to how to use the Scientific Method.

    Actually, indirectly it did.  While I'll also comment on the actual article, after Gail shows that integers are indeed faster than strings, which I don't doubt, she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.  The method outlined only deals with performance.  This is a question of data integrity.  You and I have discussed this in other forums before and we seem to be in general agreement.  As much as I respect Gail's technical knowledge, I have experienced too often that performance is put ahead of data integrity.  If we use the scientific method, all we have shown is that integers are faster than strings.  This says nothing about natural keys vs surrogate keys.

  • Jeff Moden wrote:

     It's kind of like having a hypothesis of "Why did the chicken cross the road" and then coming to the conclusion that all chickens cross the road.

    You just spilled my drink again Jeff

    😎

    Are the arriving as chickens or as nuggets on the other side of the road?  My point is that we do accept far too many data nuggets.

  • RonKyle wrote:

    The article certainly isn't about CHAR v.s. VARCHAR v.s. INT PKs. She wisely picked an example that's not only common and, sometimes, hotly contested, but one that's incredibly simple and easy to understand by people that might have no clue as to how to use the Scientific Method.

    Actually, indirectly it did.  While I'll also comment on the actual article, after Gail shows that integers are indeed faster than strings, which I don't doubt, she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.  The method outlined only deals with performance.  This is a question of data integrity.  You and I have discussed this in other forums before and we seem to be in general agreement.  As much as I respect Gail's technical knowledge, I have experienced too often that performance is put ahead of data integrity.  If we use the scientific method, all we have shown is that integers are faster than strings.  This says nothing about natural keys vs surrogate keys.

    I think the reason why this article was so popular here is that its fun to measure easily measured things. The eternal debate between natural keys versus surrogate keys is a harder topic which lacks agreement on how to even apply measurements. Many folks want to measure strictly based on performance and neglect the more subjective areas like suitability for purpose, maintainability of code, clarity of design, but that also mirrors SSC's focus, most folks here do not design or architect software solutions, they write queries, people spend thousands of hours shaving microseconds off of character splitters, etc whereas the relatively subjective areas less easily measured sort of turn into repetitive, spammy nonsense at best.

    While the original thread may seem to have been sidetracked by the "environmental / global warming" thing, in fact the "sidetracking" was just stark evidence illustrating the limitations of the scientific method, some of which also apply to more immediate questions like natural versus surrogate keys. You can only apply the scientific method if you agree on what you intend to measure.

     

  • The really hard part is getting people to agree that there is not always a natural key to be had and that those natural keys that seem adequate are frequently not because they don't meet some of the most basic rules for keys, such as immutability.  Other's insist that you can't have both natural and surrogate keys for the same entity while some actually claim that you need neither.  Because of such understandings/misunderstanding/personal feelings, there will continue to be disagreement and not much can be done to compel a change... until things break.  And I do absolutely agree that data integrity is absolutely paramount but performance does follow a very close second provided that the data integrity has been accomplished.

    As to spending time shaving milliseconds off of things like splitters goes, there are many places where milliseconds (and related Memory I/O) form an eternity in the face of thousands of calls per second.  It nothing else, the practice of spending time learning how to shave milliseconds is good practice for when it actually counts and also acts to reduce the damage caused when someone uses such things as splitters in an unintended and frequently inappropriate manner.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The really hard part is getting people to agree that there is not always a natural key to be had and that those natural keys that seem adequate are frequently not because they don't meet some of the most basic rules for keys, such as immutability.  Other's insist that you can't have both natural and surrogate keys for the same entity while some actually claim that you need neither.  Because of such understandings/misunderstanding/personal feelings, there will continue to be disagreement and not much can be done to compel a change... until things break.  And I do absolutely agree that data integrity is absolutely paramount but performance does follow a very close second provided that the data integrity has been accomplished.

    I know that.  As I said, you and I have discussed this on other forums and seem to be in general agreement.  In OLTP designs, use natural keys if there is a suitable key, otherwise use surrogates.  Data integrity is paramount.  Performance is a close second, but it is second.  Obviously with extra precautions surrogate keys can avoid this.  But it does take more effort.  And that extra effort is not factored into this experiment.  Not to mention times with the presence of a natural key will allow the join to not even have to be made, such as a state code.

    • This reply was modified 5 years ago by  RonKyle.
  • Is it just me or can you no longer get to the original article?  All links to it point to the DISCUSSION, not to the ARTICLE.  Sadly the article was much more meaningful than the sniping that went on a few years back.

    Of course I could have missed how to actually get back to the article itself.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I used this link and found the article with no issues: https://sqlinthewild.co.za/index.php/2011/02/15/are-int-joins-faster-than-string-joins-2/

  • RonKyle wrote:

    I used this link and found the article with no issues: https://sqlinthewild.co.za/index.php/2011/02/15/are-int-joins-faster-than-string-joins-2/%5B/quote%5D

    Thanks - although that's not actually this particular article.  Two new behaviors I didn't expect:

    1.  the first post in this thread used to be the link back to the original article and that seems to have disappeared.
    2.   As I recall - you could get to the article directly from the active threads view under the old format.

    That said - if you try to access the editorial from the homepage (and I assume form the editorials page) you DO get to the original article, which is this:

    https://www.sqlservercentral.com/editorials/the-scientific-method-a-call-to-action

     

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Hey, that approach worked brilliantly on the Chevrolet Corvair though, didn't it ...  um.

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • meilenb wrote:

    yet there are still thousands of scientists who argue climate change is not human caused, that it is the natural cycle of the earth. Ice sheets are breaking off, sea level rises, on and on...

    No there aren't.  Find a single reputable, genuine, scientist working in the field, or any associated discipline, who spouts that nonsense.  There is no dissent within the scientific community on climate crisis.

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • RonKyle wrote:

    she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.

    And it's not a conclusion I drew. The only conclusion drawn was that joins on integers are faster than joins on strings. The blog post states at the end:

    "While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests."

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail Shaw wrote:

    RonKyle wrote:

    she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.

    And it's not a conclusion I drew. The only conclusion drawn was that joins on integers are faster than joins on strings. The blog post states at the end:

    "While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests."

    C'mon Gail.  You shouldn't have talked about natural keys at all.  It's your paragraph's topic sentence.  You have to at least admit it was a reasonable inference to assume you thought this had some bearing on the statement.  It does not.  As for data warehouse (OLAP) systems, they should always use surrogate keys as a matter of design.  The only exception I make in this regard is for flush-and-fill ETL systems.  Incremental ETLs should avoid this.  The only support I give to natural keys is in OLTP designs.

  • RonKyle wrote:

    Gail Shaw wrote:

    RonKyle wrote:

    she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.

    And it's not a conclusion I drew. The only conclusion drawn was that joins on integers are faster than joins on strings. The blog post states at the end:

    "While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests."

    C'mon Gail.  You shouldn't have talked about natural keys at all.  It's your paragraph's topic sentence.  You have to at least admit it was a reasonable inference to assume you thought this had some bearing on the statement.  It does not.  As for data warehouse (OLAP) systems, they should always use surrogate keys as a matter of design.  The only exception I make in this regard is for flush-and-fill ETL systems.  Incremental ETLs should avoid this.  The only support I give to natural keys is in OLTP designs.

     

    Why the hell not?  It IS STRONGLY related and she did NOT say to avoid the use of natural keys but to take them into consideration.  Based on your original misread of her conclusion, your suggestion seems to be a bit on the side of sour grapes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    RonKyle wrote:

    Gail Shaw wrote:

    RonKyle wrote:

    she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.

    And it's not a conclusion I drew. The only conclusion drawn was that joins on integers are faster than joins on strings. The blog post states at the end:

    "While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests."

    C'mon Gail.  You shouldn't have talked about natural keys at all.  It's your paragraph's topic sentence.  You have to at least admit it was a reasonable inference to assume you thought this had some bearing on the statement.  It does not.  As for data warehouse (OLAP) systems, they should always use surrogate keys as a matter of design.  The only exception I make in this regard is for flush-and-fill ETL systems.  Incremental ETLs should avoid this.  The only support I give to natural keys is in OLTP designs.

    Why the hell not?  It IS STRONGLY related and she did NOT say to avoid the use of natural keys but to take them into consideration.  Based on your original misread of her conclusion, your suggestion seems to be a bit on the side of sour grapes.

    Not at all.  I'm using language.  The hypothesis simply deals with which join is faster.  The conclusion should have stayed in that area.  Had it not been referenced in the topic sentence, it might have been a different matter.  Not sure the response required a "strong" word to make your point.

  • RonKyle wrote:

    Jeff Moden wrote:

    RonKyle wrote:

    Gail Shaw wrote:

    RonKyle wrote:

    she then concludes that these results should all but put the nail in the coffin on integers vs naturals keys.  This is NOT a valid conclusion to draw.

    And it's not a conclusion I drew. The only conclusion drawn was that joins on integers are faster than joins on strings. The blog post states at the end:

    "While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests."

    C'mon Gail.  You shouldn't have talked about natural keys at all.  It's your paragraph's topic sentence.  You have to at least admit it was a reasonable inference to assume you thought this had some bearing on the statement.  It does not.  As for data warehouse (OLAP) systems, they should always use surrogate keys as a matter of design.  The only exception I make in this regard is for flush-and-fill ETL systems.  Incremental ETLs should avoid this.  The only support I give to natural keys is in OLTP designs.

    Why the hell not?  It IS STRONGLY related and she did NOT say to avoid the use of natural keys but to take them into consideration.  Based on your original misread of her conclusion, your suggestion seems to be a bit on the side of sour grapes.

    Not at all.  I'm using language.  The hypothesis simply deals with which join is faster.  The conclusion should have stayed in that area.  Had it not been referenced in the topic sentence, it might have been a different matter.  Not sure the response required a "strong" word to make your point.

    I just think that Gail and Jeff want us to overlook the text indicating that they MAY have purchased a coffin that they're hammering nails into.

    The threadizens predictable attack on perfectly reasonably offered counterpoints just makes this entire thread junk for me personally, I just feel that the "big hitters" here are somehow pathologically terrified at being incorrect in even the slightest degree, and this for me is turning SSC into the LAST place to go when discussing anything that gets the least bit complex and subjective. At least I got a giggle on Jeff's traumatic reaction, maybe I'll send him a "get well" card, maybe I'll send him a fake oscar statue, I dunno, whadya think? I have included a poll, feel free to select 3, my therapist did!

    I should:

    1: Send Jeff a "get well" card

    2: Send Jeff a fake oscar.

    3: Delete my account and get a hobby I would actually enjoy.

     

Viewing 15 posts - 136 through 150 (of 168 total)

You must be logged in to reply to this topic. Login to reply