July 25, 2010 at 1:38 pm
Hello - I have a report used to compare records looking for duplicates. The sql creates two temp tables: one is the "base" table and the other contains the "base" information plus the probable duplicate record, and this is the data appearing in the report. I am using substrings to compare the data. I'd like to make the "length" portion of the substring a parameter for the first name, last name and address fields. I'd also like the user to be able to put in any value for the length.
If I hard-code the substrings, all this works great. If I put a parameter into the length of the substring function and have a parameter definition of integer, it will not work. I'm looking for answers to the following questions (and I've looked at BOL and Googled, etc., and had no joy): Can I do what I'm looking to do? If so, then how do I set up the parameter so that the user can enter a value to be passed? Any good further reading/examples? Thank you very much for your time and any assistance! ๐
Below is a sample of the code, showing all the things I have going on with this which work (obviously, this is a representation):
create table
#dup_universe (
constituentid varchar(20)
,firstname varchar(50)
,lastname varchar(100)
,prefaddress varchar(150)
,prefzip varchar(12)
)
CREATE NONCLUSTERED INDEX [ix_dup_universe] ON #dup_universe
(
constituentid,
firstname,
lastname,
prefaddress,
prefzip
)
------------------------------------------------------------------------------------------
insert into #dup_universe
SELECT distinct
DATA
from database
CREATE TABLE -- need a table to measure the quality of dup detection
#dup_rating ( --columns from a and from b
dupid varchar(20)
,dupfirst varchar(50)
,duplast varchar(100)
,dupprefaddr varchar(150) NULL
,dupprefzip varchar(12) NULL --20
,matchid varchar(20) --dupid
,matchfirst varchar(50) --dupfirst
,matchlast varchar(100) --duplast
,matchaddress varchar(150) NULL --dupprefaddr
,matchprefzip varchar(12) NULL --dupprefzip --20
)
CREATE NONCLUSTERED INDEX [ix_dup_rating] ON #dup_rating
(
dupid,
dupfirst,
duplast,
dupprefaddr,
matchid,
matchfirst,
matchlast,
matchaddress
)
insert into #dup_rating
select distinct
a.constituentid dupid
,a.firstname dupfirst
,a.lastname duplast
,a.prefaddress dupprefaddr
,a.prefzip dupzip
,b.constituentid matchid
,b.firstname matchfirst
,b.lastname matchlast
,b.prefaddress dupprefaddr
,b.prefzip dupzip
from #dup_universe a
JOIN #dup_universe b
on substring(a.firstname,1,10) = substring(b.firstname,1,10) --@firstlen
and substring(a.lastname,1,10) = substring(b.lastname,1,10) --@lastlen
and substring(a.prefaddress,1,15) = substring(b.prefaddress,1,15) --@addrlen
and a.constituentid <> b.constituentid
where a.constituentid not in (select r.dupid from #dup_rating r)
order by a.lastname
SELECT * from #dup_rating
DROP TABLE #DUP_UNIVERSE
DROP TABLE #DUP_RATING
July 26, 2010 at 12:43 am
Y u r using Substr, if u want to compare firstname and lastname.. u can compare by removing leading and trailing spaces in the string... ltrim()... rtrim() can be used for the same...
For complete matching of the prefaddress... u can use like...
and substring(a.prefaddress,1,len(a.prefaddress)) = substring(b.prefaddress,1,len(b.prefaddress)) --@addrlen
Let me know if that doesn't complete ur requirement....
July 26, 2010 at 5:23 am
Hi, Sumit - that's not actually what I'm looking for - I want to use a substring because when we do the name or address comparison, I always want to start at the beginning and then allow the user to select how many characters from there she wants to match on.
UPDATE: I (of course) finally just set up the variables and then kicked off the code in Studio. With the substring lengths hard-coded, it takes 3-5 minutes to run. A long time, but not bad considering what it's doing. With the parameter variables, it was still running 2.5 hours later when I finally killed it. This is a huge decrease in performance. Any ideas as to why this is? Is it just not possible to do what I want to do with any expectation of reasonable performance?
Thank you for your input --
Donna
July 26, 2010 at 7:27 am
Donna
Try comparing rows on zipcode, surname, LEFT(forename,1) and the first word of the first address line. This works well in the UK because postcode plus the first word of the first address line, i.e. house number, is pretty much unique, not sure how it will work for US zipcodes. Speed is good if postcode and surname are indexed, and false positives are low.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply