Join using Charindex -- there's got to be a better way?

Question

Post reply

Join using Charindex -- there's got to be a better way?

peterzeke

SSCrazy Eights

Points: 9003
More actions
October 19, 2011 at 10:20 am

#245508

I'm working with healthcare data, specifically with drug data. I've been supplied a reference table that lists common drug names and their drug group.
For example:
DrugName DrugGroup
======================
Zoloft Antidepressant
Citalopram Antidepressant
Paxil Antidepressant
The drugname in the healthcare claims data, however, is often a variation of descriptions.
So, identifying "Citalopram" may require digging through various drugnames in the claims data:
"CITALOPRAM TABLETS"
"Tab, Citalopram"
"Oral Solution Citalopram HBR"
In order to identify the DrugGroup, I've ended up having to do a join as follows:
SELECTC.DrugName
,DrugGroup = Isnull(R.DrugGroup,'UNK')
FromDrugClaims C
left Join rft_DrugXrefDrugClass R
ON charindex(R.DrugName,C.DrugName)>0
Awful, isn't it?
It gets the job done, but takes several minutes. I'd be interested in hearing suggestions for better methods.
I've pondered parsing the drugname field in the claims data into separate rows, and then doing an inner join between the tables on the actual parsed/extracted string value.
For example:
if ClaimID 123 has a drugname ="Oral Solution Citalopram HBR" , I could parse the info as
ClaimID, DrugNameParsed
123, "Oral"
123, "Solution"
123, "Citalopram"
123, "HBR"
...then an inner join to the reference table would be DrugNameParsed = R.DrugName
FYI -- leveraging a NDC reference table isn't an option at the moment. And too bad that the US federal government stopped supplying therapeutic drug classes in their drug downloads.
Thanks in advance,
Pete

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 1

Other than full text search you won't find much better.

drew.allen SSC Guru Points: 76995 More actions · Answer 2

I would use LIKE instead. You'll probably still have to do an Index Scan (assuming that column is indexed on your claim table), but that will probably save you time over the Table Scan that you're currently doing.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

Scan is a scan here. The index choice won't change and the '%whatever%' won't allow for a seek in any way.

As I said before the only possible way out is FTI.

Jack Corbett SSC Guru Points: 184394 More actions · Answer 4

I have to disagree Remi. Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE. You could do a Split on the SPACE character. You might have to do a REPLACE to get rid of the punctuation. Jeff's delimited split would be really fast for that.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 5

Jack Corbett (10/19/2011)
...Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE.... Jeff's delimited split would be really fast for that.

Jack -- you read my mind -- Indeed, I was considering Jeff Moden's split function for the task.

Cadavre SSC-Forever Points: 41690 More actions · Answer 6

No promises that this would be any better, but could you try something like this?

SELECT C.DrugName, DrugGroup = Isnull(R.DrugGroup, 'UNK')

FROM DrugClaims C

LEFT JOIN (SELECT Substring(DrugName, start, length - start - fromEnd) AS DrugName, --This is what we want to join on

DrugName AS OriginalDrugName, DrugGroup

FROM (SELECT MAX(PATINDEX(good, DrugName)) AS start,

MAX(PATINDEX(reverseGood, reverse(DrugName + ' '))) - 1 AS fromEnd,

LEN(DrugName + ' ') AS [length],DrugName,DrugGroup

FROM (--This would be a look-up table containing two columns, drug group and the reversed drug group

VALUES ('%Citalopram%','%marpolatiC%'),('%Zoloft%','%tfoloZ%'),('%Paxil%','%lixaP%')) AS f(good, reverseGood)

CROSS JOIN (--This would be your table of drugs and drug groups to be cleaned

VALUES ('Zoloft','Antidepressant'),('Citalopram','Antidepressant'),

('CitalopramRandomOtherWord','Antidepressant'),('Paxil','Antidepressant')) AS g(DrugName, DrugGroup)

GROUP BY DrugName,DrugGroup

HAVING MAX(PATINDEX(good, reverseGood)) > 0) workQuery) R ON R.DrugName = C.DrugName

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Phil Parkin SSC Guru Points: 246971 More actions · Answer 7

Surely another possibility is building a mapping table, if this is likely to be a repeated process?

Something like this:

select 'CITALOPRAM TABLETS' Pseudonym,

'Citalopram' GenericName

union

select 'Tab, Citalopram',

'Citalopram'

union

select 'Oral Solution Citalopram HBR',

'Citalopram'

where 'Pseudonym' is the PK.

Then joining on that to get to the Generic name (apologies - pharmaceutical nomenclature is not my strong point)

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 8

Jack Corbett (10/19/2011)
I have to disagree Remi. Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE. You could do a Split on the SPACE character. You might have to do a REPLACE to get rid of the punctuation. Jeff's delimited split would be really fast for that.

My guess here is that you have a ray of hope to get a seek going with FTI, which you can't have with any functions.

I agree there are 1000 ways to go about this one. I just don't have a ready made go fast switch for this one :-).

Grant Fritchey SSC Guru Points: 398898 More actions · Answer 9

What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Phil Parkin SSC Guru Points: 246971 More actions · Answer 10

Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.

That's pretty much what I suggested a couple of posts back from yours ...

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Grant Fritchey SSC Guru Points: 398898 More actions · Answer 11

Phil Parkin (10/20/2011)
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.
That's pretty much what I suggested a couple of posts back from yours ...

Oops. Sorry, missed that one. Great idea!

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Jack Corbett SSC Guru Points: 184394 More actions · Answer 12

Grant Fritchey (10/20/2011)
Phil Parkin (10/20/2011)
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.
That's pretty much what I suggested a couple of posts back from yours ...
Oops. Sorry, missed that one. Great idea!

Yeah, if this is a recurring process a mapping table is probably the best way to go although it will likely take some manual intervention on a regular basis as you get misspellings and typos.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 13

Plz explain to me what FTS does!

I know, Potato, Pôtato :hehe:

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 14

Grant Fritchey (10/20/2011)
Phil Parkin (10/20/2011)
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.
That's pretty much what I suggested a couple of posts back from yours ...
Oops. Sorry, missed that one. Great idea!

Makes sense to create a mapping table that can be leveraged for performance when doing reporting, but... doesn't the act of building such a mapping/crosswalk table return us back to my initial inquiry (query) of how to join the tables to get the correct mapping in the first place without performance totally dragging even for adding to the mapping table? At the very least, limiting the rows to evaluate being based on the latest insert_date would trim down the pool of data to be considered for appends to the mapping table, but the join itself would still be "flimsy" if charindex() or LIKE% is employed.

For example, if the mapping table has three variations for "Citalopram", and a nightly run identifies a fourth variation of this drug, what technique/join is employed to identify this new drug description in the first place? It's simple enough to use an outer join to identify the new drug names, but somehow those new drug names need to map back to a key/primary drug description.

Lastly, before I forget to say it, thanks to all of your input and participation on this topic.