July 20, 2011 at 3:53 pm
I have a bunch of duplicates within my system and I know there is an easy way to do what I need. I have a record with the name of "cat 414". I want to search a column and rank the results. I do not want to specifically look for Cat or cat 414 I want to base the results on how many of the character are used. The more similar the characters the higher the rank. Kind of a fuzzy search algorithm.
thanks for the help
July 20, 2011 at 4:07 pm
Sounds like an interesting problem, but we need a little more help before we can offer a solution.
Could you give us a little more information?
What would really help is an example showing several rows with different values in that column and how you want to see them ranked based on those values.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 21, 2011 at 9:51 am
I understand that I needed some detail. In the real process I will have an outer control loop that will feed in values like "BCD 414". I want to find the best match. This is how my feeble mind would process something like this. This will only be run once on a non production server so performance is really not a big issue. I am curious if there is a better way to do this and if there something like this that is more comprehensive that I can use in the future.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[GetMatchingData]') AND type in (N'U'))
DROP TABLE dbo.GetMatchingData
DROP TABLE dbo.TheRank
GO
Create Table GetMatchingData (
Id int identity(1,1),
TheName varchar(10))
CREATE TABLE TheRank (id int, TheName varchar(10), TheRank int)
INSERT INTO GetMatchingData (TheName)
SELECT 'bcd 414' UNION ALL
SELECT 'bcd 414' UNION ALL
SELECT 'ABDE' UNION ALL
SELECT 'Blah' UNION ALL
SELECT 'xyz414' UNION ALL
SELECT 'abcd 414' UNION ALL
SELECT 'test' UNION ALL
SELECT 'test 414' UNION ALL
SELECT 'bc44' UNION ALL
SELECT 'test1' UNION ALL
SELECT 'test2' UNION ALL
SELECT 'test4' UNION ALL
SELECT 'BCD414'
DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000),@TrimmedValue varchar(1)
SELECT @TheValue = 'bcd 414', @TheMidMin = 0
WHILE @TheMidMin<=10
BEGIN
SET @TheMidMin = @TheMidMin+1
SET @TrimmedValue = SUBSTRING(@TheValue,@TheMidMin, 1)
IF @TheMidMin = 1
BEGIN
SET @sql = '
INSERT INTO TheRank
SELECT ID, TheName,10 FROM GetMatchingData WHERE TheName LIKE ''%'+Convert(Varchar(1),@TrimmedValue)+'%'''
EXECUTE (@SQL)
END
SET @sql = '
UPDATE TheRank
SET TheRank = TheRank-1
WHERE TheName NOT LIKE ''%'+Convert(Varchar(1),@TrimmedValue)+'%'''
EXECUTE (@SQL)
END
Select * FROM TheRank
July 21, 2011 at 11:04 am
I don't really know of anything more "comprehensive". Ranking is based on a criteria of some sort which must be defined. In your example for instance, row 6 is given a rank of 10 even though it isn't an exact match for your string (although it contains an exact match). You are the only person who can define these details.
I will bow out now, and maybe someone else will have some thoughts to add.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 21, 2011 at 11:34 am
You are correct and the problem I am working with is around 3K in records. So I am hoping for more of a robust solution besides the one I put together. I wish I could search this but I just keep coming up with gibberish
July 21, 2011 at 1:00 pm
I am about to go out of town for several days, so this will be my last post, but you seem to be on the right track. The three things that could affect rank are (1) presence of matching characters, (2) absence of matching characters, (3) sequence of matching characters. In your situation, just generate a LOT of test combinations (including every edge case I could think of), rank them and then study the situations where a row ranked higher or lower than you thought it should. Gradually clarify your scoring rules. If you get away from integers you might subtract half a point for extraneous character(s) at the front or back, so that row 6 gets a 9.5, instead of a 10.
Good luck.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 21, 2011 at 3:33 pm
This looks like it might be a problem easily solved by a Tally Table. If I get the chance tonight, I'll see what I can do. Of course, don't let that stop anyone else from giving it a whirl, please.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 21, 2011 at 4:02 pm
I sent the code via email. For some reason SSC didnt like something within it so I couldnt post it here. Let me know if it works for you.
K
July 21, 2011 at 4:34 pm
Jeff, just what I was thinking...here is a starter for 10...
--= Ranked search - requires table GetMatchingData to exist =--
--= create some test searches to look for in GetMatchingData =--
IF OBJECT_ID('tempdb..#searches') IS NOT NULL
DROP TABLE #searches;
CREATE TABLE #searches (skid INT IDENTITY(1,1) PRIMARY KEY NOT NULL,search_key VARCHAR(10) NOT NULL, search_key_length AS LEN(search_key) PERSISTED);
INSERT INTO #searches ( search_key )
SELECT 'BCD 414' UNION ALL
SELECT 'ABC 123' UNION ALL
SELECT 'bcd 14' UNION ALL
SELECT 'Aft 234' UNION ALL
SELECT '??? 224' UNION ALL
SELECT 'thiswas' UNION ALL
SELECT 'notgood' UNION ALL
SELECT 'really!'
--= end of test data =--
--= Create a Tally table if required =--
IF OBJECT_ID('dbo.Tally') IS NULL
BEGIN
CREATE TABLE dbo.Tally(N INT);
CREATE UNIQUE CLUSTERED INDEX ix_cls ON dbo.Tally(N) WITH FILLFACTOR=100;
INSERT dbo.Tally(N)
SELECT TOP(11000) ROW_NUMBER() OVER(ORDER BY @@SPID) - 1
FROM sys.all_columns AS c1,sys.all_columns AS c2
END
--= Now perform the search =--
SELECT
sch.skid,
sch.search_key,
gmd.id,
SUM(calc.value) as value
FROM
(#searches as sch
CROSS JOIN
GetMatchingData AS gmd)
CROSS APPLY
(
SELECT SUBSTRING(sch.search_key,N,1) AS C,N
FROM dbo.Tally
WHERE N>0 AND N<=sch.search_key_length
) lkup
CROSS APPLY
(
SELECT CASE
WHEN CHARINDEX(lkup.C,gmd.TheName)>0 THEN
-- if the first character is matched, score 10
CASE WHEN lkup.N=1 THEN 10
-- you get nothing for a normal match on other characters
ELSE 0
END
-- You lose a point for not having one of the characters
ELSE -1
END AS value
) calc
GROUP BY
sch.skid, sch.search_key, gmd.id
HAVING
MAX(calc.value)=10 -- this ensures we only see rows that match the first character of the search key
ORDER BY
sch.skid, gmd.id
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
July 21, 2011 at 4:35 pm
K Smith Thanks that is pretty slick and gets the results to where abcd 414 drops in rank. So you go through every Ascii symbol between 33 and 126 and if it matches in the value (present) and it matches the table value it updates +1. Then for every character that is in the ascii and is not in the value (present) rank goes -1. I think I can work around this.
Create Table #GetMatchingData (
Id int identity(1,1),
TheName varchar(10),
Rank int default 0)
INSERT INTO #GetMatchingData (TheName)
SELECT 'bcd 414' UNION ALL
SELECT 'bcd 414' UNION ALL
SELECT 'ABDE' UNION ALL
SELECT 'Blah' UNION ALL
SELECT 'xyz414' UNION ALL
SELECT 'abcd 414' UNION ALL
SELECT 'test' UNION ALL
SELECT 'test 414' UNION ALL
SELECT 'bc44' UNION ALL
SELECT 'test1' UNION ALL
SELECT 'test2' UNION ALL
SELECT 'test4' UNION ALL
SELECT 'BCD414'
DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000), @TrimmedValue varchar(1), @v-2 int
select @v-2 = 33
select @TrimmedValue = char(@v)
SELECT @TheValue = 'bcd 414', @TheMidMin = 0
WHILE @TheMidMin<=10
BEGIN
SET @TheMidMin = @TheMidMin+1
while @v-2 <= 126
begin
SET @TrimmedValue = char(@v)
Update #GetMatchingData
SET Rank = Rank + 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) > 0
Update #GetMatchingData
SET Rank = Rank - 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) <= 0
end
END
Select * FROM #GetMatchingData
order by rank desc
July 21, 2011 at 4:43 pm
M.Maggoo - That is an unbelievable query but the results are the same as my loop. Where if I am looking for BCD 414 I get
search_keyidthenamevalue
BCD 4141bcd 41410
BCD 4142bcd 41410
BCD 4146abcd 41410
When we want abcd 414 to be 9
As for your query I am thinking I will wait to the morning to digest what you put together. You guys are really good.
July 21, 2011 at 4:46 pm
JKSQL (7/21/2011)
M.Maggoo - That is an unbelievable query but the results are the same as my loop. Where if I am looking for BCD 414 I getsearch_keyidthenamevalue
BCD 4141bcd 41410
BCD 4142bcd 41410
BCD 4146abcd 41410
When we want abcd 414 to be 9
As for your query I am thinking I will wait to the morning to digest what you put together. You guys are really good.
The results are the same because I did not see you ask for them to be different and I tend to try and produce the requested result 😉
To get better help, you need to define the rules better, because I based the logic in that query on your logic in the loop. If that logic is incorrect, please try to explain all the rules...
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
July 21, 2011 at 4:54 pm
Yeah K Smith changed my goal. Now I want it allllllllll. 🙂 Going into this I really thought there was a function that would just do this. Figured that there is always dirty data people need to clean. So my sample was just my feeble go at it. I was surprised by the results I have seen. I currently use a tally table and I am starting to see what you have done but it seems like another way to skin a cat in this case. I think this case is busted. I will have to do an outer loop and use ascii to loop through and get my ranking. Then at the end i will give as much information to the PM so they can make there duplicate matches. After I get that I get to obsolete records and remap them to active records. This path is a lot better than eye balling over 3K in bad data.
Thanks again everyone for the effort
July 21, 2011 at 5:50 pm
Kath Smith (7/21/2011)
I sent the code via email. For some reason SSC didnt like something within it so I couldnt post it here. Let me know if it works for you.K
It's normally something like an "injection blocker" on the user side when than happens. A lot of companies have them (as does the one I work for).
The best way to get around it is to attach the code. That way others can see and benefit from what you've done.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 21, 2011 at 6:25 pm
JKSQL (7/21/2011)
Yeah K Smith changed my goal. Now I want it allllllllll. 🙂 Going into this I really thought there was a function that would just do this. Figured that there is always dirty data people need to clean. So my sample was just my feeble go at it. I was surprised by the results I have seen. I currently use a tally table and I am starting to see what you have done but it seems like another way to skin a cat in this case. I think this case is busted. I will have to do an outer loop and use ascii to loop through and get my ranking. Then at the end i will give as much information to the PM so they can make there duplicate matches. After I get that I get to obsolete records and remap them to active records. This path is a lot better than eye balling over 3K in bad data.Thanks again everyone for the effort
Whoa! Back the similarity engine up here! 😀 Using the following test data and the code you posted, ask yourself how much the following data actually resembles the search criteria. 😉
drop table #getmatchingdata
Create Table #GetMatchingData (
Id int identity(1,1),
TheName varchar(10),
Rank int default 0)
INSERT INTO #GetMatchingData (TheName)
SELECT TOP 3000
LEFT(NEWID(),10)
FROM sys.all_columns
set nocount on
DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000), @TrimmedValue varchar(1), @v-2 int
select @v-2 = 33
select @TrimmedValue = char(@v)
SELECT @TheValue = 'bcd 414', @TheMidMin = 0
WHILE @TheMidMin<=10
BEGIN
SET @TheMidMin = @TheMidMin+1
while @v-2 <= 126
begin
SET @TrimmedValue = char(@v)
Update #GetMatchingData
SET Rank = Rank + 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) > 0
Update #GetMatchingData
SET Rank = Rank - 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) <= 0
end
END
Select * FROM #GetMatchingData
order by rank desc
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply