Using "or" in a join? Whatsupwiththat?!

  • My colleagued just showed me some sql code inherited from a previous developer that contains an "or" in the join -- I've simplified the actual code to illustrate the core of the query:

    SELECT

    C.CLAIM_ID

    ,C.SERVICEDATE

    ,C.DIAG_01

    ,C.DIAG_02

    ,C.DIAG_03

    ,C.DIAG_04

    FROM CLAIMS C

    JOIN RFT_TARGET_DIAGNOSES T

    ON T.DIAG_CODE = C.DIAG_01

    OR T.DIAG_CODE = C.DIAG_02

    OR T.DIAG_CODE = C.DIAG_03

    OR T.DIAG_CODE = C.DIAG_04

    Oddly enough, I've never seen this method of joining tables using "or" -- I think this is referred to as a "conditional join" based on some brief searches in the SSC forums. (I suppose using "or" is somewhat akin to an outer join)

    A preview of the expected execution plan indicates that an index seek will be used; don't find any cluster/table scans. So, the code appears to be OK... but is it? I've not come across articles hailing the benefits of using "or" in a join.

    I did come across one post http://www.sqlservercentral.com/Forums/Topic512291-338-1.aspx?Highlight=conditional+join where Lynn Pettis actually posted a suggested solution that employs an "or" in the join.

    Can anyone point me to an article or detailed discussion thread?

    Thanks in advance,

    Pete

  • Nothing to discuss here.

    Join is not automatically a single = operation. What happens if you have a multiple keys PK?

    also that query could be rewritten as such =>

    JOIN RFT_TARGET_DIAGNOSES T

    ON T.DIAG_CODE IN (C.DIAG_01, C.DIAG_02, C.DIAG_03, C.DIAG_04)

    It makes perfect sense, it's just that the table design is "wrong" and not normalized.

  • It's probably being used to accommodate a poor level of data normalization (should be rows, not columns), but it's not an uncommon thing to do.

    It is, indeed, a "conditional join". There are other variations (like joining to different tables depending on data values), but this is a common enough one.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (11/14/2011)


    It's probably being used to accommodate a poor level of data normalization (should be rows, not columns), but it's not an uncommon thing to do.

    It is, indeed, a "conditional join". There are other variations (like joining to different tables depending on data values), but this is a common enough one.

    Imagine this (even if this exemple makes no real sens in life).

    SELECT * FROM Sales S INNER JOIN dbo.Calendar C ON S.Date BETWEEN C.BeginMonthDt AND C.EndMonthDt

    AND C.Y= 2011 AND C.M = 11

  • Ninja's_RGR'us (11/14/2011)


    that query could be rewritten as such =>

    JOIN RFT_TARGET_DIAGNOSES T

    ON T.DIAG_CODE IN (C.DIAG_01, C.DIAG_02, C.DIAG_03, C.DIAG_04)

    Huh... don't recall ever seeing an "in" claused used in a join quite like that either -- croikey, what rock have I been under? (At least I can say I've seen Case statements used in conditional joins...)

    So, if an execution plan indicates that indexes are being leveraged (seeks), are conditional joins OK to use assuming the query is essentially set-based vs. RBAR (row by row)?

  • Ninja's_RGR'us (11/14/2011)


    GSquared (11/14/2011)


    It's probably being used to accommodate a poor level of data normalization (should be rows, not columns), but it's not an uncommon thing to do.

    It is, indeed, a "conditional join". There are other variations (like joining to different tables depending on data values), but this is a common enough one.

    Imagine this (even if this exemple makes no real sens in life).

    SELECT * FROM Sales S INNER JOIN dbo.Calendar C ON S.Date BETWEEN C.BeginMonthDt AND C.EndMonthDt

    AND C.Y= 2011 AND C.M = 11

    Ok, I can see how this type of range query is also a type of conditional join.

  • peterzeke (11/14/2011)


    Ninja's_RGR'us (11/14/2011)


    that query could be rewritten as such =>

    JOIN RFT_TARGET_DIAGNOSES T

    ON T.DIAG_CODE IN (C.DIAG_01, C.DIAG_02, C.DIAG_03, C.DIAG_04)

    Huh... don't recall ever seeing an "in" claused used in a join quite like that either -- croikey, what rock have I been under? (At least I can say I've seen Case statements used in conditional joins...)

    So, if an execution plan indicates that indexes are being leveraged (seeks), are conditional joins OK to use assuming the query is essentially set-based vs. RBAR (row by row)?

    Performance code touches each row as little times as possible. RBAR is usually a form of loop.

    That code is just denormalized, not RBAR.

  • An "IN ()" statement is just a shorthand way of typing multiple OR statements, so it works the same way.

    Conditional joins like that are fine if they are what's needed to get the right data. They aren't RBAR.

    Seeks vs scans is a different issue. There are times when scans are better, there are (more) times when seeks are better. Same rules apply to those regardless of how the join is written.

    If there's any problem here, it's violation of normal forms. Shouldn't have four columns for that data, in most cases. Should have rows for it. But that's not a problem with the join, it's a problem with the table design. If it's a problem at all. (Can't be sure on that point without know a lot more about both the business and the database.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Ninja's_RGR'us and GSquared -- thanks for the dialogue and examples. I have a better sense of the various forms of conditional joins -- the "or" still looks weird even though permissible.

    As for your concerns about the lack of normalization of the data, while not ideal, it's not my primary concern, presently. Understanding a former employee's legacy code is the priority.

    By the way, if you've ever worked with healthcare billing data, the source data seems to be supplied always as flat as possible (i.e., 20 diagnosis fields on a single row is typical, plus if multiple services were performed, a separate row for each service is created... along with the same 20 diagnoses repeated on each service row. Yuck) We eventually massage the data into a relational form, and then end up in a dimensional data warehouse for reporting.

    Thanks again for your assistance.

    --pete

  • I totally understand prioritizing that way.

    And I used to deal with financial services data going from mainframe flat-files into relational stores, then into analysis cubes, then back to flat files. So, yeah, I feel your pain on that one!

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply