Join using Charindex -- there's got to be a better way?

  • I'm working with healthcare data, specifically with drug data. I've been supplied a reference table that lists common drug names and their drug group.

    For example:

    DrugName DrugGroup

    ======================

    Zoloft Antidepressant

    Citalopram Antidepressant

    Paxil Antidepressant

    The drugname in the healthcare claims data, however, is often a variation of descriptions.

    So, identifying "Citalopram" may require digging through various drugnames in the claims data:

    "CITALOPRAM TABLETS"

    "Tab, Citalopram"

    "Oral Solution Citalopram HBR"

    In order to identify the DrugGroup, I've ended up having to do a join as follows:

    SELECTC.DrugName

    ,DrugGroup = Isnull(R.DrugGroup,'UNK')

    FromDrugClaims C

    left Join rft_DrugXrefDrugClass R

    ON charindex(R.DrugName,C.DrugName)>0

    Awful, isn't it?

    It gets the job done, but takes several minutes. I'd be interested in hearing suggestions for better methods.

    I've pondered parsing the drugname field in the claims data into separate rows, and then doing an inner join between the tables on the actual parsed/extracted string value.

    For example:

    if ClaimID 123 has a drugname ="Oral Solution Citalopram HBR" , I could parse the info as

    ClaimID, DrugNameParsed

    123, "Oral"

    123, "Solution"

    123, "Citalopram"

    123, "HBR"

    ...then an inner join to the reference table would be DrugNameParsed = R.DrugName

    FYI -- leveraging a NDC reference table isn't an option at the moment. And too bad that the US federal government stopped supplying therapeutic drug classes in their drug downloads.

    Thanks in advance,

    Pete

  • Other than full text search you won't find much better.

  • I would use LIKE instead. You'll probably still have to do an Index Scan (assuming that column is indexed on your claim table), but that will probably save you time over the Table Scan that you're currently doing.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Scan is a scan here. The index choice won't change and the '%whatever%' won't allow for a seek in any way.

    As I said before the only possible way out is FTI.

  • I have to disagree Remi. Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE. You could do a Split on the SPACE character. You might have to do a REPLACE to get rid of the punctuation. Jeff's delimited split would be really fast for that.

  • Jack Corbett (10/19/2011)


    ...Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE.... Jeff's delimited split would be really fast for that.

    Jack -- you read my mind -- Indeed, I was considering Jeff Moden's split function for the task.

  • No promises that this would be any better, but could you try something like this?

    SELECT C.DrugName, DrugGroup = Isnull(R.DrugGroup, 'UNK')

    FROM DrugClaims C

    LEFT JOIN (SELECT Substring(DrugName, start, length - start - fromEnd) AS DrugName, --This is what we want to join on

    DrugName AS OriginalDrugName, DrugGroup

    FROM (SELECT MAX(PATINDEX(good, DrugName)) AS start,

    MAX(PATINDEX(reverseGood, reverse(DrugName + ' '))) - 1 AS fromEnd,

    LEN(DrugName + ' ') AS [length],DrugName,DrugGroup

    FROM (--This would be a look-up table containing two columns, drug group and the reversed drug group

    VALUES ('%Citalopram%','%marpolatiC%'),('%Zoloft%','%tfoloZ%'),('%Paxil%','%lixaP%')) AS f(good, reverseGood)

    CROSS JOIN (--This would be your table of drugs and drug groups to be cleaned

    VALUES ('Zoloft','Antidepressant'),('Citalopram','Antidepressant'),

    ('CitalopramRandomOtherWord','Antidepressant'),('Paxil','Antidepressant')) AS g(DrugName, DrugGroup)

    GROUP BY DrugName,DrugGroup

    HAVING MAX(PATINDEX(good, reverseGood)) > 0) workQuery) R ON R.DrugName = C.DrugName


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Surely another possibility is building a mapping table, if this is likely to be a repeated process?

    Something like this:

    select 'CITALOPRAM TABLETS' Pseudonym,

    'Citalopram' GenericName

    union

    select 'Tab, Citalopram',

    'Citalopram'

    union

    select 'Oral Solution Citalopram HBR',

    'Citalopram'

    where 'Pseudonym' is the PK.

    Then joining on that to get to the Generic name (apologies - pharmaceutical nomenclature is not my strong point)

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Jack Corbett (10/19/2011)


    I have to disagree Remi. Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE. You could do a Split on the SPACE character. You might have to do a REPLACE to get rid of the punctuation. Jeff's delimited split would be really fast for that.

    My guess here is that you have a ray of hope to get a seek going with FTI, which you can't have with any functions.

    I agree there are 1000 ways to go about this one. I just don't have a ready made go fast switch for this one :-).

  • What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (10/20/2011)


    What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

    That's pretty much what I suggested a couple of posts back from yours ...

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin (10/20/2011)


    Grant Fritchey (10/20/2011)


    What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

    That's pretty much what I suggested a couple of posts back from yours ...

    Oops. Sorry, missed that one. Great idea!

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (10/20/2011)


    Phil Parkin (10/20/2011)


    Grant Fritchey (10/20/2011)


    What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

    That's pretty much what I suggested a couple of posts back from yours ...

    Oops. Sorry, missed that one. Great idea!

    Yeah, if this is a recurring process a mapping table is probably the best way to go although it will likely take some manual intervention on a regular basis as you get misspellings and typos.

  • Plz explain to me what FTS does!

    I know, Potato, Pôtato :hehe:

  • Grant Fritchey (10/20/2011)


    Phil Parkin (10/20/2011)


    Grant Fritchey (10/20/2011)


    What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

    That's pretty much what I suggested a couple of posts back from yours ...

    Oops. Sorry, missed that one. Great idea!

    Makes sense to create a mapping table that can be leveraged for performance when doing reporting, but... doesn't the act of building such a mapping/crosswalk table return us back to my initial inquiry (query) of how to join the tables to get the correct mapping in the first place without performance totally dragging even for adding to the mapping table? At the very least, limiting the rows to evaluate being based on the latest insert_date would trim down the pool of data to be considered for appends to the mapping table, but the join itself would still be "flimsy" if charindex() or LIKE% is employed.

    For example, if the mapping table has three variations for "Citalopram", and a nightly run identifies a fourth variation of this drug, what technique/join is employed to identify this new drug description in the first place? It's simple enough to use an outer join to identify the new drug names, but somehow those new drug names need to map back to a key/primary drug description.

    Lastly, before I forget to say it, thanks to all of your input and participation on this topic.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply