A search with a Rank

Question

Post reply

A search with a Rank

JKSQL

SSCrazy Eights

Points: 8481
More actions
July 20, 2011 at 3:53 pm

#240193

I have a bunch of duplicates within my system and I know there is an easy way to do what I need. I have a record with the name of "cat 414". I want to search a column and rank the results. I do not want to specifically look for Cat or cat 414 I want to base the results on how many of the character are used. The more similar the characters the higher the rank. Kind of a fuzzy search algorithm.
thanks for the help
Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 1

Sounds like an interesting problem, but we need a little more help before we can offer a solution.

Could you give us a little more information?

What would really help is an example showing several rows with different values in that column and how you want to see them ranked based on those values.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

JKSQL SSCrazy Eights Points: 8481 More actions · Answer 2

I understand that I needed some detail. In the real process I will have an outer control loop that will feed in values like "BCD 414". I want to find the best match. This is how my feeble mind would process something like this. This will only be run once on a non production server so performance is really not a big issue. I am curious if there is a better way to do this and if there something like this that is more comprehensive that I can use in the future.

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[GetMatchingData]') AND type in (N'U'))

DROP TABLE dbo.GetMatchingData

DROP TABLE dbo.TheRank

GO

Create Table GetMatchingData (

Id int identity(1,1),

TheName varchar(10))

CREATE TABLE TheRank (id int, TheName varchar(10), TheRank int)

INSERT INTO GetMatchingData (TheName)

SELECT 'bcd 414' UNION ALL

SELECT 'ABDE' UNION ALL

SELECT 'Blah' UNION ALL

SELECT 'xyz414' UNION ALL

SELECT 'abcd 414' UNION ALL

SELECT 'test' UNION ALL

SELECT 'test 414' UNION ALL

SELECT 'bc44' UNION ALL

SELECT 'test1' UNION ALL

SELECT 'test2' UNION ALL

SELECT 'test4' UNION ALL

SELECT 'BCD414'

DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000),@TrimmedValue varchar(1)

SELECT @TheValue = 'bcd 414', @TheMidMin = 0

WHILE @TheMidMin<=10

BEGIN

SET @TheMidMin = @TheMidMin+1

SET @TrimmedValue = SUBSTRING(@TheValue,@TheMidMin, 1)

IF @TheMidMin = 1

BEGIN

SET @SQL = '

INSERT INTO TheRank

SELECT ID, TheName,10 FROM GetMatchingData WHERE TheName LIKE ''%'+Convert(Varchar(1),@TrimmedValue)+'%'''

EXECUTE (@SQL)

END

SET @SQL = '

UPDATE TheRank

SET TheRank = TheRank-1

WHERE TheName NOT LIKE ''%'+Convert(Varchar(1),@TrimmedValue)+'%'''

EXECUTE (@SQL)

END

Select * FROM TheRank

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 3

I don't really know of anything more "comprehensive". Ranking is based on a criteria of some sort which must be defined. In your example for instance, row 6 is given a rank of 10 even though it isn't an exact match for your string (although it contains an exact match). You are the only person who can define these details.

I will bow out now, and maybe someone else will have some thoughts to add.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

JKSQL SSCrazy Eights Points: 8481 More actions · Answer 4

You are correct and the problem I am working with is around 3K in records. So I am hoping for more of a robust solution besides the one I put together. I wish I could search this but I just keep coming up with gibberish

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

I am about to go out of town for several days, so this will be my last post, but you seem to be on the right track. The three things that could affect rank are (1) presence of matching characters, (2) absence of matching characters, (3) sequence of matching characters. In your situation, just generate a LOT of test combinations (including every edge case I could think of), rank them and then study the situations where a row ranked higher or lower than you thought it should. Gradually clarify your scoring rules. If you get away from integers you might subtract half a point for extraneous character(s) at the front or back, so that row 6 gets a 9.5, instead of a 10.

Good luck.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

This looks like it might be a problem easily solved by a Tally Table. If I get the chance tonight, I'll see what I can do. Of course, don't let that stop anyone else from giving it a whirl, please.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kath Smith Old Hand Points: 344 More actions · Answer 7

I sent the code via email. For some reason SSC didnt like something within it so I couldnt post it here. Let me know if it works for you.

K

mister.magoo SSC-Forever Points: 47174 More actions · Answer 8

Jeff, just what I was thinking...here is a starter for 10...

--= Ranked search - requires table GetMatchingData to exist =--

--= create some test searches to look for in GetMatchingData =--

IF OBJECT_ID('tempdb..#searches') IS NOT NULL

DROP TABLE #searches;

CREATE TABLE #searches (skid INT IDENTITY(1,1) PRIMARY KEY NOT NULL,search_key VARCHAR(10) NOT NULL, search_key_length AS LEN(search_key) PERSISTED);

INSERT INTO #searches ( search_key )

SELECT 'BCD 414' UNION ALL

SELECT 'ABC 123' UNION ALL

SELECT 'bcd 14' UNION ALL

SELECT 'Aft 234' UNION ALL

SELECT '??? 224' UNION ALL

SELECT 'thiswas' UNION ALL

SELECT 'notgood' UNION ALL

SELECT 'really!'

--= end of test data =--

--= Create a Tally table if required =--

IF OBJECT_ID('dbo.Tally') IS NULL

BEGIN

CREATE TABLE dbo.Tally(N INT);

CREATE UNIQUE CLUSTERED INDEX ix_cls ON dbo.Tally(N) WITH FILLFACTOR=100;

INSERT dbo.Tally(N)

SELECT TOP(11000) ROW_NUMBER() OVER(ORDER BY @@SPID) - 1

FROM sys.all_columns AS c1,sys.all_columns AS c2

END

--= Now perform the search =--

SELECT

sch.skid,

sch.search_key,

gmd.id,

SUM(calc.value) as value

FROM

(#searches as sch

CROSS JOIN

GetMatchingData AS gmd)

CROSS APPLY

(

SELECT SUBSTRING(sch.search_key,N,1) AS C,N

FROM dbo.Tally

WHERE N>0 AND N<=sch.search_key_length

) lkup

CROSS APPLY

(

SELECT CASE

WHEN CHARINDEX(lkup.C,gmd.TheName)>0 THEN

-- if the first character is matched, score 10

CASE WHEN lkup.N=1 THEN 10

-- you get nothing for a normal match on other characters

ELSE 0

END

-- You lose a point for not having one of the characters

ELSE -1

END AS value

) calc

GROUP BY

sch.skid, sch.search_key, gmd.id

HAVING

MAX(calc.value)=10 -- this ensures we only see rows that match the first character of the search key

ORDER BY

sch.skid, gmd.id

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

JKSQL SSCrazy Eights Points: 8481 More actions · Answer 9

K Smith Thanks that is pretty slick and gets the results to where abcd 414 drops in rank. So you go through every Ascii symbol between 33 and 126 and if it matches in the value (present) and it matches the table value it updates +1. Then for every character that is in the ascii and is not in the value (present) rank goes -1. I think I can work around this.

Create Table #GetMatchingData (

Id int identity(1,1),

TheName varchar(10),

Rank int default 0)

INSERT INTO #GetMatchingData (TheName)

SELECT 'bcd 414' UNION ALL

SELECT 'ABDE' UNION ALL

SELECT 'Blah' UNION ALL

SELECT 'xyz414' UNION ALL

SELECT 'abcd 414' UNION ALL

SELECT 'test' UNION ALL

SELECT 'test 414' UNION ALL

SELECT 'bc44' UNION ALL

SELECT 'test1' UNION ALL

SELECT 'test2' UNION ALL

SELECT 'test4' UNION ALL

SELECT 'BCD414'

DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000), @TrimmedValue varchar(1), @v int

select @v = 33

select @TrimmedValue = char(@v)

SELECT @TheValue = 'bcd 414', @TheMidMin = 0

WHILE @TheMidMin<=10

BEGIN

SET @TheMidMin = @TheMidMin+1

while @v <= 126

begin

SET @TrimmedValue = char(@v)

Update #GetMatchingData

SET Rank = Rank + 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) > 0

Update #GetMatchingData

SET Rank = Rank - 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) <= 0

select @v = @v + 1

end

END

Select * FROM #GetMatchingData

order by rank desc

JKSQL SSCrazy Eights Points: 8481 More actions · Answer 10

M.Maggoo - That is an unbelievable query but the results are the same as my loop. Where if I am looking for BCD 414 I get

search_keyidthenamevalue

BCD 4141bcd 41410

BCD 4142bcd 41410

BCD 4146abcd 41410

When we want abcd 414 to be 9

As for your query I am thinking I will wait to the morning to digest what you put together. You guys are really good.

mister.magoo SSC-Forever Points: 47174 More actions · Answer 11

JKSQL (7/21/2011)
M.Maggoo - That is an unbelievable query but the results are the same as my loop. Where if I am looking for BCD 414 I get
search_keyidthenamevalue
BCD 4141bcd 41410
BCD 4142bcd 41410
BCD 4146abcd 41410
When we want abcd 414 to be 9
As for your query I am thinking I will wait to the morning to digest what you put together. You guys are really good.

The results are the same because I did not see you ask for them to be different and I tend to try and produce the requested result 😉

To get better help, you need to define the rules better, because I based the logic in that query on your logic in the loop. If that logic is incorrect, please try to explain all the rules...

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

JKSQL SSCrazy Eights Points: 8481 More actions · Answer 12

Yeah K Smith changed my goal. Now I want it allllllllll. 🙂 Going into this I really thought there was a function that would just do this. Figured that there is always dirty data people need to clean. So my sample was just my feeble go at it. I was surprised by the results I have seen. I currently use a tally table and I am starting to see what you have done but it seems like another way to skin a cat in this case. I think this case is busted. I will have to do an outer loop and use ascii to loop through and get my ranking. Then at the end i will give as much information to the PM so they can make there duplicate matches. After I get that I get to obsolete records and remap them to active records. This path is a lot better than eye balling over 3K in bad data.

Thanks again everyone for the effort

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 13

Kath Smith (7/21/2011)
I sent the code via email. For some reason SSC didnt like something within it so I couldnt post it here. Let me know if it works for you.
K

It's normally something like an "injection blocker" on the user side when than happens. A lot of companies have them (as does the one I work for).

The best way to get around it is to attach the code. That way others can see and benefit from what you've done.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 14

JKSQL (7/21/2011)
Yeah K Smith changed my goal. Now I want it allllllllll. 🙂 Going into this I really thought there was a function that would just do this. Figured that there is always dirty data people need to clean. So my sample was just my feeble go at it. I was surprised by the results I have seen. I currently use a tally table and I am starting to see what you have done but it seems like another way to skin a cat in this case. I think this case is busted. I will have to do an outer loop and use ascii to loop through and get my ranking. Then at the end i will give as much information to the PM so they can make there duplicate matches. After I get that I get to obsolete records and remap them to active records. This path is a lot better than eye balling over 3K in bad data.
Thanks again everyone for the effort

Whoa! Back the similarity engine up here! 😀 Using the following test data and the code you posted, ask yourself how much the following data actually resembles the search criteria. 😉

drop table #getmatchingdata

Create Table #GetMatchingData (

Id int identity(1,1),

TheName varchar(10),

Rank int default 0)

INSERT INTO #GetMatchingData (TheName)

SELECT TOP 3000

LEFT(NEWID(),10)

FROM sys.all_columns

set nocount on

DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000), @TrimmedValue varchar(1), @v int

select @v = 33

select @TrimmedValue = char(@v)

SELECT @TheValue = 'bcd 414', @TheMidMin = 0

WHILE @TheMidMin<=10

BEGIN

SET @TheMidMin = @TheMidMin+1

while @v <= 126

begin

SET @TrimmedValue = char(@v)

Update #GetMatchingData

SET Rank = Rank + 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) > 0

Update #GetMatchingData

SET Rank = Rank - 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) <= 0

select @v = @v + 1

end

END

Select * FROM #GetMatchingData

order by rank desc

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)