October 19, 2011 at 10:20 am
I'm working with healthcare data, specifically with drug data. I've been supplied a reference table that lists common drug names and their drug group.
For example:
DrugName DrugGroup
======================
Zoloft Antidepressant
Citalopram Antidepressant
Paxil Antidepressant
The drugname in the healthcare claims data, however, is often a variation of descriptions.
So, identifying "Citalopram" may require digging through various drugnames in the claims data:
"CITALOPRAM TABLETS"
"Tab, Citalopram"
"Oral Solution Citalopram HBR"
In order to identify the DrugGroup, I've ended up having to do a join as follows:
SELECTC.DrugName
,DrugGroup = Isnull(R.DrugGroup,'UNK')
FromDrugClaims C
left Join rft_DrugXrefDrugClass R
ON charindex(R.DrugName,C.DrugName)>0
Awful, isn't it?
It gets the job done, but takes several minutes. I'd be interested in hearing suggestions for better methods.
I've pondered parsing the drugname field in the claims data into separate rows, and then doing an inner join between the tables on the actual parsed/extracted string value.
For example:
if ClaimID 123 has a drugname ="Oral Solution Citalopram HBR" , I could parse the info as
ClaimID, DrugNameParsed
123, "Oral"
123, "Solution"
123, "Citalopram"
123, "HBR"
...then an inner join to the reference table would be DrugNameParsed = R.DrugName
FYI -- leveraging a NDC reference table isn't an option at the moment. And too bad that the US federal government stopped supplying therapeutic drug classes in their drug downloads.
Thanks in advance,
Pete
October 19, 2011 at 10:22 am
Other than full text search you won't find much better.
October 19, 2011 at 11:46 am
I would use LIKE instead. You'll probably still have to do an Index Scan (assuming that column is indexed on your claim table), but that will probably save you time over the Table Scan that you're currently doing.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 19, 2011 at 12:01 pm
Scan is a scan here. The index choice won't change and the '%whatever%' won't allow for a seek in any way.
As I said before the only possible way out is FTI.
October 19, 2011 at 12:11 pm
I have to disagree Remi. Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE. You could do a Split on the SPACE character. You might have to do a REPLACE to get rid of the punctuation. Jeff's delimited split would be really fast for that.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
October 19, 2011 at 12:32 pm
Jack Corbett (10/19/2011)
...Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE.... Jeff's delimited split would be really fast for that.
Jack -- you read my mind -- Indeed, I was considering Jeff Moden's split function for the task.
October 20, 2011 at 2:58 am
No promises that this would be any better, but could you try something like this?
SELECT C.DrugName, DrugGroup = Isnull(R.DrugGroup, 'UNK')
FROM DrugClaims C
LEFT JOIN (SELECT Substring(DrugName, start, length - start - fromEnd) AS DrugName, --This is what we want to join on
DrugName AS OriginalDrugName, DrugGroup
FROM (SELECT MAX(PATINDEX(good, DrugName)) AS start,
MAX(PATINDEX(reverseGood, reverse(DrugName + ' '))) - 1 AS fromEnd,
LEN(DrugName + ' ') AS [length],DrugName,DrugGroup
FROM (--This would be a look-up table containing two columns, drug group and the reversed drug group
VALUES ('%Citalopram%','%marpolatiC%'),('%Zoloft%','%tfoloZ%'),('%Paxil%','%lixaP%')) AS f(good, reverseGood)
CROSS JOIN (--This would be your table of drugs and drug groups to be cleaned
VALUES ('Zoloft','Antidepressant'),('Citalopram','Antidepressant'),
('CitalopramRandomOtherWord','Antidepressant'),('Paxil','Antidepressant')) AS g(DrugName, DrugGroup)
GROUP BY DrugName,DrugGroup
HAVING MAX(PATINDEX(good, reverseGood)) > 0) workQuery) R ON R.DrugName = C.DrugName
October 20, 2011 at 3:24 am
Surely another possibility is building a mapping table, if this is likely to be a repeated process?
Something like this:
select 'CITALOPRAM TABLETS' Pseudonym,
'Citalopram' GenericName
union
select 'Tab, Citalopram',
'Citalopram'
union
select 'Oral Solution Citalopram HBR',
'Citalopram'
where 'Pseudonym' is the PK.
Then joining on that to get to the Generic name (apologies - pharmaceutical nomenclature is not my strong point)
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
October 20, 2011 at 5:06 am
Jack Corbett (10/19/2011)
I have to disagree Remi. Full Text would probably work, but doing the parse Peter mentioned at the end of his original post would probably be faster than the CHARINDEX or LIKE. You could do a Split on the SPACE character. You might have to do a REPLACE to get rid of the punctuation. Jeff's delimited split would be really fast for that.
My guess here is that you have a ray of hope to get a seek going with FTI, which you can't have with any functions.
I agree there are 1000 ways to go about this one. I just don't have a ready made go fast switch for this one :-).
October 20, 2011 at 5:17 am
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 20, 2011 at 5:21 am
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.
That's pretty much what I suggested a couple of posts back from yours ...
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
October 20, 2011 at 5:57 am
Phil Parkin (10/20/2011)
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.That's pretty much what I suggested a couple of posts back from yours ...
Oops. Sorry, missed that one. Great idea!
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 20, 2011 at 6:17 am
Grant Fritchey (10/20/2011)
Phil Parkin (10/20/2011)
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.That's pretty much what I suggested a couple of posts back from yours ...
Oops. Sorry, missed that one. Great idea!
Yeah, if this is a recurring process a mapping table is probably the best way to go although it will likely take some manual intervention on a regular basis as you get misspellings and typos.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
October 20, 2011 at 6:20 am
Plz explain to me what FTS does!
I know, Potato, Pôtato :hehe:
October 20, 2011 at 8:06 am
Grant Fritchey (10/20/2011)
Phil Parkin (10/20/2011)
Grant Fritchey (10/20/2011)
What about building a drug name mapping table. You run through the data once, nightly, after a load, something, and map the names supplied to the drugs available into a new table, then use joins to pull the data back out. You'll be able to index it and get good performance without hopping through hoops.That's pretty much what I suggested a couple of posts back from yours ...
Oops. Sorry, missed that one. Great idea!
Makes sense to create a mapping table that can be leveraged for performance when doing reporting, but... doesn't the act of building such a mapping/crosswalk table return us back to my initial inquiry (query) of how to join the tables to get the correct mapping in the first place without performance totally dragging even for adding to the mapping table? At the very least, limiting the rows to evaluate being based on the latest insert_date would trim down the pool of data to be considered for appends to the mapping table, but the join itself would still be "flimsy" if charindex() or LIKE% is employed.
For example, if the mapping table has three variations for "Citalopram", and a nightly run identifies a fourth variation of this drug, what technique/join is employed to identify this new drug description in the first place? It's simple enough to use an outer join to identify the new drug names, but somehow those new drug names need to map back to a key/primary drug description.
Lastly, before I forget to say it, thanks to all of your input and participation on this topic.
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply