November 8, 2011 at 3:54 am
Hi,
I'd be very grateful if someone could help me with this, I do hope someone can 🙂 I'll try and provide as much info as possible so to avoid confusion as to what my question is.
Client requirement: Client has a table that holds articles in one column, think of them as digital newspaper articles. They have a search webpage to find articles that match the user's keywords. Along with this search functionality the user can specify the proximity that the keywords must be from each other to qualify. Basically how many words (not characters) one word must be from the other to return true.
Search Example: If the user specifies 5 words apart and enters the search term as 'search word', then the query should return all rows where an article contains the words 'search' and 'word' and they are no more than 5 words apart.
Result example
True because only 2 words apart from each other: 'This is a search that has word in it'
False because 7 words apart from each other: 'The search will not come out true because the word is more than 5 words apart.'
Added complexity 1: This seems reasonable enough however there is the added complexity that you could have more than two search keywords, in which case they must all be within the specified word proximity.
Added complexity 2: The whole article must be searched not just the first occurance of a keyword.
i.e. if we use the example:
Search keyword: 'search word' within 5 words...
Article: 'The search will not come out true because the word is more than 5 words apart. But this is a search that has word in it.'
The above is true because the second sentence contains the keywords within 5 words proximity even if the first sentence did not.
I know that Full Text Search has the ability to use the operator NEAR however in 2008 you can not set a proximity limit that the words must be from each other. I know that 2012 does but because I am running on 2008 I need to find a way to programatically do this.
Please oh please can someone help me. Right now I am stuck and do not know how to do this.
Many thanks in advance, 🙂
Lewis
November 8, 2011 at 4:37 am
I think I would look to write a full-text query to get results that might match (i.e. the text has all the words specified, perhaps using NEAR, perhaps not) and then pass that through a function (probably a streaming CLR one) to further filter the results to those that have all the words within x words of each other. Something like that.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
November 8, 2011 at 4:41 am
Hi,
Thanks Paul.
With out sounding like an SOB that's the problem I am having, finding all those that have words within a set distance from each other. The pre-filter is easily done, it's that second stage I can not get my head around.
Lewis
November 8, 2011 at 4:56 am
The only way i can think of to do this, is to use the filtered list as suggested then in your CLR or Application you would need to loop through each string and count the number of spaces to determine the number of words.
November 8, 2011 at 5:10 am
Regular expressions! They are perfect for things like this. You can't use them directly in SQL of course, but they work fine in the application or CLR. For example: http://www.regular-expressions.info/near.html
November 8, 2011 at 5:39 am
lewisdow123 (11/8/2011)
With out sounding like an SOB that's the problem I am having, finding all those that have words within a set distance from each other. The pre-filter is easily done, it's that second stage I can not get my head around.
Hi Lewis,
Well the first part of the process is relatively easy:
DECLARE @strings TABLE
(
id INTEGER IDENTITY PRIMARY KEY,
string VARCHAR(4000) NOT NULL
)
INSERT @strings (string)
VALUES
('This is a search that has word in it'),
('The search will not come out true because the word is more than 5 words apart.')
DECLARE @terms TABLE
(
word VARCHAR(50) NOT NULL
)
INSERT @terms (word)
VALUES ('search'), ('word'), ('is')
DECLARE @words TABLE
(
string_id INTEGER NOT NULL,
position INTEGER NOT NULL,
word VARCHAR(50) NOT NULL,
PRIMARY KEY (string_id, position)
)
-- Split input strings into words
INSERT @words (string_id, position, word)
SELECT
s2.id,
ROW_NUMBER() OVER (PARTITION BY s2.id ORDER BY s.number),
f2.word
FROM @strings AS s2
JOIN master.dbo.spt_values AS s ON s.number BETWEEN 1 AND DATALENGTH(s2.string)
CROSS APPLY (SELECT SPACE(1) + s2.string + SPACE(1)) AS f (wrapped)
CROSS APPLY (SELECT SUBSTRING(f.wrapped, s.number + 1, CHARINDEX(SPACE(1), f.wrapped, s.number + 1) - s.number)) AS f2 (word)
WHERE
s.[type] = N'P'
AND SUBSTRING(f.wrapped, s.number, 1) = SPACE(1)
ORDER BY
s.number
-- Remove words we're not interested in
DELETE @words
WHERE
NOT EXISTS
(SELECT 1 FROM @terms AS t WHERE t.word = [@words].word)
-- Results
SELECT
*
FROM @words AS w
Working out a robust and efficient algorithm to check whether the 'all words within x' condition is met, where more than one instance of each word occurs, is less trivial. Perhaps start with a brute-force iterative method (per string, construct all permutations, select those that match the conditions).
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
November 8, 2011 at 7:05 am
Thanks for your input folks...
I'm leading down the regex path first using the link provided and http://www.sqlteam.com/article/regular-expressions-in-t-sql
if this fails then i'll start with Paul's suggestion.
Again, thank you guys 🙂
November 8, 2011 at 7:19 am
lewisdow123 (11/8/2011)
Yes that's the CLR route I would try too. The T-SQL code I posted was more to illustrate a concept, as I'm sure you guessed anyway...:-)
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
November 8, 2011 at 7:35 am
The complication is in allowing more than 2 words.
The following checks for 2 words that are 5 or less words apart.
using System;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Text.RegularExpressions;
namespace searchFeatures
{
public class UserDefinedFunctions
{
[SqlFunction]
public static SqlString SearchFeatures(SqlString txtIn, SqlString word1, SqlString word2)
{
try
{
var regex = new Regex(@"\b" + word1.Value + @"\W+(?:\w+\W+){1,5}?" + word2.Value + @"\b", RegexOptions.Compiled);
return regex.IsMatch(txtIn.Value) ? txtIn.Value : SqlString.Null;
}
catch (Exception)
{
return SqlString.Null;
}
}
}
}
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[SearchFeatures]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[SearchFeatures]
GO
IF EXISTS (SELECT * FROM sys.assemblies asms WHERE asms.name = N'searchFeatures' and is_user_defined = 1)
DROP ASSEMBLY [searchFeatures]
GO
CREATE ASSEMBLY [searchFeatures]
AUTHORIZATION [dbo]
FROM 
WITH PERMISSION_SET = SAFE
GO
CREATE FUNCTION [dbo].[SearchFeatures](@txtIn [nvarchar](4000), @word1 [nvarchar](4000), @word2 [nvarchar](4000))
RETURNS [nvarchar](4000) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [searchFeatures].[searchFeatures.UserDefinedFunctions].[SearchFeatures]
GO
DECLARE @strings TABLE (
id INTEGER IDENTITY PRIMARY KEY
,string VARCHAR(4000) NOT NULL
)
INSERT @strings (string)
VALUES ('This is a search that has word in it')
,('The search will not come out true because the word is more than 5 words apart.')
SELECT id, string
FROM @strings
WHERE dbo.SearchFeatures(string,'search','word') IS NOT NULL
November 8, 2011 at 7:43 am
Wow very nice!!!!
Top answer by far. I've done the same thing but without the .Net instead I used the first function in this link....
http://www.sqlteam.com/article/regular-expressions-in-t-sql
Thanks to all who have helped me out, you're all very kind
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply