July 19, 2004 at 4:44 pm
Hi all,
My first post on SQLServerCentral, and it's a good 'un!
What I'm trying to do is find a clever way of writing a matching routine which refines the match until it gets one and only one match.
Test on rule one: if that finds one and only one match, insert a record into output table for that id. If it finds more than one match, add the next rule. If it finds no match, return no match.
Test on rule two: if that finds one and only one match based on rule one and rule two together, insert a record into the output table. If it finds more than one match using the two rules, add the next rule. If it find no match using these two rules, remove the second rule and add the third.
In other words, you can't simply keep adding criteria to the matching routine, as there is a need to omit a rule if it kills the matching routine - the idea is that the rules get added in the best combination (but only in a particular order) until it gets only one match.
Any thoughts, anyone? I'm presuming a cursor will be required (so this isn't going to be quick on a table of 160,000 records), but rather than code every possible combination of tests explicitly, I'm sure there's a clever way of doing this.
See, I told you it was a good one!
July 20, 2004 at 3:35 am
Why use cursor if they're not necessary.
If i understand what you need is similar to this
set @count = select count(1) from table where row1 = value1
Case
When @count = 0 Then
return 'found nothing'
When @count = 1 Then
insert into table2(rows)
select rows
from table
where row1 = value1
When @count > 1 Then
Begin
set @count =
select count(1)
from table
where row1 = value1
and row2 = value2
End
and so on, probably you can use some kind of recursivity for your SP
July 20, 2004 at 8:56 am
Created on the fly no testing
Probable poor performance
Assumes complete sql (including rules/test) less than 4000 bytes
CREATE TABLE #tests (rowid int IDENTITY(1,1),test nvarchar(100))
INSERT INTO #tests (test) values ('where rule1')
INSERT INTO #tests (test) values ('or rule2')
INSERT INTO #tests (test) values ('or rule3')
INSERT INTO #tests (test) values ('or rule4')
DECLARE @startsql nvarchar(36), @sql nvarchar(4000), @test-2 nvarchar(4000), @count int, @maxct int, @result int
SET @startsql = 'SELECT @result=COUNT(*) FROM
'
SET @count = 1
SET @test-2 = =''
SELECT @test-2 = @test-2 + ' ' + test
FROM #tests
WHERE rowid = @count
ORDER BY rowid
SET @sql = @startsql + @test-2
EXEC sp_executesql @sql,N'@result int output',@result output
SELECT @maxct = MAX(rowid) FROM #tests
IF @result > 1
BEGIN
WHILE (@count < @maxct) AND (@result <> 1)
BEGIN
SET @test-2 = ''
SET @count = @count + 1
SELECT @test-2 = @test-2 + ' ' + test
FROM #tests
WHERE rowid <= @count
ORDER BY rowid
SET @sql = @startsql + @test-2
EXEC sp_executesql @sql,N'@result int output',@result output
IF @result = 0 THEN
BEGIN
DELETE FROM #tests WHERE rowid = @count
SET @result = 99
END
END
END
-- @result = 0 No Match
-- @result = 1 One Match and @test-2 = where clause
-- @result > 1 no test(s) resulted in a single match
Far away is close at hand in the images of elsewhere.
Anon.
July 20, 2004 at 12:25 pm
Thanks for the replies. The solution I had in mind was more like David's interpretation than Marco's, but both were gratefully received. My initial post wasn't all that straightforward to understand, I know...
I got this far in an attempt (rather clumsily) like the second before picking it up (excuse big code fragment at end of post). The complexity is that there are 11 tests, which I'm storing in a table so that the order in which they are applied can be changed by a user. However, I'm getting bogged down in string handling and caught in my nested loops as I'm trying to avoid cursors.
If anyone can make anything of the following, I'd be very happy to discuss further
-----------------------------------------------------------------------------------------------------------
declare @newdataid as int
declare @ruleid as int
declare @criteria as varchar (4000)
declare @sql as nvarchar (4000)
declare @updatesql as nvarchar(4000)
declare @result as int
declare @match as varchar(20)
declare @capcode as varchar(20)
declare @valuedate as varchar(20)
declare @currentcriteria as varchar(200)
declare @manufacturer as varchar(2)
declare @newprice as int
declare @fueltype as varchar(1)
declare @body as varchar(1)
declare @doors as int
declare @engine as int
declare @transmission as varchar(1)
declare @fueldelivery as varchar(1)
declare @trim as varchar(3)
set @newdataid = 1
set @ruleid = 2
set @criteria = 'valuedate = @valuedate and manufacturer = @manufacturer'
while @newdataid < (select max([id]) from newdata)+1
begin
set @capcode = (select capcode from newdata where [id] = @newdataid)
set @valuedate = (select valuedate from newdata where [id] = @newdataid)
set @manufacturer = (select manufacturer from newdata where [id] = @newdataid)
set @newprice = (select newprice from newdata where [id] = @newdataid)
set @fueltype = (select fueltype from newdata where [id]=@newdataid)
set @body = (select body from newdata where [id]=@newdataid)
set @doors = (select doors from newdata where [id]=@newdataid)
set @engine = (select engine from newdata where [id]=@newdataid)
set @transmission = (select transmission from newdata where [id]=@newdataid)
set @fueldelivery = (select fueldelivery from newdata where [id]=@newdataid)
set @trim = (select trim from newdata where [id]=@newdataid)
while @ruleid < (select max([id]) from rules)+1
begin
set @criteria = replace(@criteria,'@manufacturer',''''+@manufacturer+'''')
set @criteria = replace(@criteria,'@valuedate',''''+@valuedate+'''')
set @currentcriteria = (select criteria from rules where [id]=@ruleid)
set @currentcriteria = replace(@currentcriteria,'@manufacturer',''''+@manufacturer+'''')
set @currentcriteria = replace(@currentcriteria,'@valuedate',''''+@valuedate+'''')
set @currentcriteria = replace(@currentcriteria,'@newprice',@newprice)
set @currentcriteria = replace(@currentcriteria,'@fueltype',''''+@fueltype+'''')
set @currentcriteria = replace(@currentcriteria,'@body',''''+@body+'''')
set @currentcriteria = replace(@currentcriteria,'@doors',@doors)
set @currentcriteria = replace(@currentcriteria,'@engine',@engine)
set @currentcriteria = replace(@currentcriteria,'@transmission',''''+@transmission+'''')
set @currentcriteria = replace(@currentcriteria,'@fueldelivery',''''+@fueldelivery+'''')
set @currentcriteria = replace(@currentcriteria,'@trim',''''+@trim+'''')
set @criteria = @criteria + ' and ' + @currentcriteria
set @sql = N'insert into outputtable select count(*) from newdata where ' + @criteria
set @updatesql = N'insert into outputcode select capcode from newdata where ' + @criteria
select @sql
truncate table outputtable
exec sp_executesql @sql
set @result = (select [output] from outputtable)
select 'result is ',@result
if @result = 1
begin
--select 'inside the result 1 loop',@result
truncate table outputcode
exec sp_executesql @updatesql
set @match = (select [outputcode] from outputcode)
--select @match
update predecessorresults set match = @match where capcode = @capcode and valuedate = @valuedate
set @ruleid = 2
BREAK
end
if @result = 0
begin
--select 'inside the result 0 loop',@result
set @sql = substring(@sql,1,len(@sql)-(len(@currentcriteria)+5))
end
set @ruleid = @ruleid+1
--select 'onto id',@ruleid
end
set @ruleid = 2
set @newdataid = @newdataid + 1
end
----------------------------------------------------------------------------------------------------------
July 21, 2004 at 2:38 am
I think you're going the wrong direction....i think an easier solution would be to start at the other end....test for the presence of 1 record "given all conditions applied"....and if no records found...remove 1 criteria and start again...eventually you will arrive at the required matching record.....if you end up matching a record because 5 conditions are true....even though the 1st 3 conditions would have been sufficient to get a match....it won't influence the result...you will still find the same matching record.
July 21, 2004 at 6:13 am
Firstly I would remove the subqueries from the while statements and do them once at the beginning, ie
declare @maxdataid int, @maxruleid int
select @maxdataid = isnull(max([id]),0) from newdata
select @maxruleid = isnull(max([id]),0) from rules
then
while @newdataid <= @maxdataid
...
while @ruleid <= @maxruleid
...
Secondly replace the following
set @ruleid = 2
BREAK
with
set @ruleid = @maxruleid + 1
set @newdataid = @maxdataid + 1
to end the loops
Far away is close at hand in the images of elsewhere.
Anon.
July 23, 2004 at 9:08 am
Thanks for everyones input - I got it working by my original method, posted here if anyone's interested. It took a long time to run (6 hours on a decent server, against a large dataset) but got there in the end.
declare @newdataid as int
declare @ruleid as int
declare @criteria as varchar (4000)
declare @sql as nvarchar (4000)
declare @updatesql as nvarchar(4000)
declare @result as int
declare @match as varchar(20)
declare @capcode as varchar(20)
declare @valuedate as varchar(20)
declare @currentcriteria as varchar(200)
declare @manufacturer as varchar(2)
declare @newprice as int
declare @fueltype as varchar(1)
declare @body as varchar(1)
declare @doors as int
declare @engine as int
declare @transmission as varchar(1)
declare @fueldelivery as varchar(1)
declare @trim as varchar(3)
declare @rule12sql as nvarchar(4000)
declare @min-2 as int
declare @max-2 as int
set @newdataid = 1
set @ruleid = 2
set @criteria = 'valuedate = @valuedate and manufacturer = @manufacturer'
while @newdataid <(select max([id]) from newdatatomatch)+1
begin
set @capcode = (select capcode from newdatatomatch where [id] = @newdataid)
set @valuedate = (select valuedate from newdatatomatch where [id] = @newdataid)
set @manufacturer = (select manufacturer from newdatatomatch where [id] = @newdataid)
set @newprice = (select newprice from newdatatomatch where [id] = @newdataid)
set @fueltype = (select fueltype from newdatatomatch where [id]=@newdataid)
set @body = (select body from newdatatomatch where [id]=@newdataid)
set @doors = (select doors from newdatatomatch where [id]=@newdataid)
set @engine = (select engine from newdatatomatch where [id]=@newdataid)
set @transmission = (select transmission from newdatatomatch where [id]=@newdataid)
set @fueldelivery = (select fueldelivery from newdatatomatch where [id]=@newdataid)
set @trim = (select trim from newdatatomatch where [id]=@newdataid)
while @ruleid < (select max([id]) from rules)+1
begin
set @criteria = replace(@criteria,'@manufacturer',''''+@manufacturer+'''')
set @criteria = replace(@criteria,'@valuedate',''''+@valuedate+'''')
set @currentcriteria = (select criteria from rules where [id]=@ruleid)
set @currentcriteria = replace(@currentcriteria,'@manufacturer',''''+@manufacturer+'''')
set @currentcriteria = replace(@currentcriteria,'@valuedate',''''+@valuedate+'''')
set @currentcriteria = replace(@currentcriteria,'@newprice',@newprice)
set @currentcriteria = replace(@currentcriteria,'@fueltype',''''+@fueltype+'''')
set @currentcriteria = replace(@currentcriteria,'@body',''''+@body+'''')
set @currentcriteria = replace(@currentcriteria,'@doors',@doors)
set @currentcriteria = replace(@currentcriteria,'@engine',@engine)
set @currentcriteria = replace(@currentcriteria,'@transmission',''''+@transmission+'''')
set @currentcriteria = replace(@currentcriteria,'@fueldelivery',''''+@fueldelivery+'''')
set @currentcriteria = replace(@currentcriteria,'@trim',''''+@trim+'''')
set @criteria = @criteria + ' and ' + @currentcriteria
set @sql = N'insert into outputtable select count(*) from newdatatomatch where ' + @criteria
set @updatesql = N'insert into outputcode select capcode from newdatatomatch where ' + @criteria
set @rule12sql = N'insert into rule12 select capcode, newprice from newdatatomatch where ' + @criteria
--select @capcode,right(@sql,100)
truncate table outputtable
exec sp_executesql @sql
set @result = (select [output] from outputtable)
--select 'result is ',@result
if @result = 0
begin
--select 'inside the result 0 loop',@result
set @criteria = substring(@criteria,1,len(@criteria)-(len(@currentcriteria)+5))
end
else
if @result = 1
begin
--select 'inside the result 1 loop',@result
truncate table outputcode
exec sp_executesql @updatesql
set @match = (select [outputcode] from outputcode)
--select @match
update predecessorresults set match = @match where capcode = @capcode and valuedate = @valuedate
--select @capcode, 'matches with',@match, 'on',@valuedate
break
continue
end
set @ruleid = @ruleid+1
if @ruleid = 12
begin
--select 'inside ruleid12 loop',@result
truncate table rule12
exec sp_executesql @rule12sql
set @min-2 = (select min(newprice) from rule12 where newprice > @newprice)
set @max-2 = (select max(newprice) from rule12 where newprice < @newprice)
set @match = (select r1.capcode from rule12 r1 where r1.newprice = (select case when @newprice-@min < @max-@newprice then @max-2 else @min-2 end))
update predecessorresults set match = @match where capcode = @capcode and valuedate = @valuedate
--select @capcode, 'matches with',@match, 'on',@valuedate
break
continue
end
--select 'onto id',@ruleid
end
set @ruleid = 2
set @newdataid = @newdataid + 1
set @criteria = 'valuedate = @valuedate and manufacturer = @manufacturer'
--select 'onto dataid', @newdataid
end
March 23, 2005 at 10:13 am
[BUMP]
My solution as described above, even after refinement and indexing, is still too slow - has anyone any different thoughts on a way to do this? One possibility I'm thinking of is scoring the match per field per record as a binary and finding the highest value of this binary:
Record to search for: "a,b,c,d,e"
Table of potentials
"a,b,c,x,x" = 11100 = 26
"a,x,x,d,e" = 10011 = 19
"x,b,x,d,x" = 01010 = 12
therefore "a,b,c,x,y" is the best match
...anyone?
March 24, 2005 at 2:36 am
mmm...interesting approach.
can you put the calculation into a derived column...which may be indexed?
you may have to run an update statement to evaluate every row w.r.t the 'a,b,c,d,e' tests and then check for the highest cumulative score.
you seem to be suggesting that the 'rules' are independant of each other....but that also that the rules seem to have a precedence...with a better score for certain combinations?
however....looking back at your code...I see that you can 'rationalise' some of it down...which may help.
select @capcode=capcode, @valuedate=valuedate, @manufacturer=manufacturer, etc.....from newdatatomatch where [id] = @newdataid)
this should be somewhat better.
Also....can you post the execution plans?
March 27, 2005 at 7:16 am
Thanks for the reply.
Derived column - yes, that sounds like a plan.
The number of updates would be an issue for speed, I think - which I why I initially favoured an iterative set-based approach as it would take out lots of results at once thereby limiting the number of rows per iteration after that. However, easier said than done...
The rules are independent of each other but do indeed have an order to them - the order is user-set by a front-end app, so that they can choose whether to match on one criteria is more important than another as this will affect the results.
One thing which is bugging me on this - I've had some comp sci and maths training, and to me this sounds like the sort of problem that there's a "best" logical solution to, unfortunately for the life of me I can't figure it out, and haven't manage to google it either
Any clever buggrs out there, please feel free to put me out of my misery!
March 28, 2005 at 3:03 pm
G'Day All,
Ian: Thank you for a most interesting problem.
This response is part one of two. See the post immediately prior
for an explanation of how I arrived at this point.
I worked on two approaches in my spare time over the last few days. The first solution proceeded down the path of derived columns and bitmapped values that can be readily indexed. By assigning a power of two to each combination of attribute/value, I was able to construct a simple looping stored procedure that excluded rules that did not have any matching values.
While this solution worked, there is a finite number of attribute value
pairs that can be addressed by using a BIGINT. Additionally, I am a firm believer that *almost* any procedural implementation can be written as a set based solution if one is sufficiently clever.
The second solution is significantly more elegant, and relies on a
strategy that borders on heresy, at least from a DB theory perspective. Stated simply, turn columns into rows and use simple joins to eliminate rules that do not apply. This approach is sometimes referred to as a keyword/value table. From the resulting subset of data, select the single distinct row that matches has the "lowest" set of rules. In the case of multiple matches, arbitrarily select the "top 1".
There are well known problems with a character based, keyword/value implementation. I acknowledge all of the issues. However, one definition of an expert suggests that an expert knows the rules, and more importantly knows when and how to break the rules effectively. I believe that this may be one of those times.
The next modification to the algorithm was to extract all of the text for the Keyword/Value pairs out to a lookup table, and assign a unique numberic identifier to each combination using the identity column property. By indexing on this table, and the appropriate columns on the actual data table, we should obtain reasonably good performance that scales well.
The TestData and TestRule tables currently have columns supporting both approaches. A reasonable next step would be to remove the VARCHAR attribute and value columns from these two tables. They remain in place for now only to support testing and debugging where character data is a bit easier to understand than a bunch of integers.
I am currently testing with large volumns of both attributes and values to see what happens to performance. The actual code will appear in the next post.
Wayne
March 28, 2005 at 3:05 pm
G'Day All,
Ian: Thank you again for a most interesting problem.
This response is part two of two. See the post immediately prior
for an explanation of how I arrived at this point.
Wayne
CREATE TABLE TestAttributeValue (
AVRowID INT IDENTITY(1,1),
AVAttribute VARCHAR(20),
AVValue VARCHAR(3)
)
GO
INSERT INTO TestAttributeValue (AVAttribute,AVValue)
SELECT 'manufacturer', 'FD' UNION
SELECT 'manufacturer', 'GM' UNION
SELECT 'manufacturer', 'CH' UNION
SELECT 'fueltype', 'G' UNION
SELECT 'fueltype', 'D' UNION
SELECT 'fueltype', 'E' UNION
SELECT 'fueltype', 'H' UNION
SELECT 'bodystyle', 'S' UNION
SELECT 'bodystyle', 'W' UNION
SELECT 'bodystyle', 'T' UNION
SELECT 'bodystyle', 'B' UNION
SELECT 'doors', '2' UNION
SELECT 'doors', '4' UNION
SELECT 'engine', '4' UNION
SELECT 'engine', '6' UNION
SELECT 'engine', '8' UNION
SELECT 'transmission', 'A' UNION
SELECT 'transmission', 'M' UNION
SELECT 'trim', 'RED' UNION
SELECT 'trim', 'ORG' UNION
SELECT 'trim', 'YLW' UNION
SELECT 'trim', 'GRN' UNION
SELECT 'trim', 'BLU' UNION
SELECT 'trim', 'IND' UNION
SELECT 'trim', 'VLT' UNION
SELECT 'trim', 'BLK' UNION
SELECT 'trim', 'WHT'
GO
CREATE UNIQUE CLUSTERED INDEX idx_TestAttributeValue
ON TestAttributeValue (AVAttribute,AVValue)
GO
CREATE TABLE TestData (
DataRowID INT IDENTITY(1,1),
DataExternalID VARCHAR(15),
DataAttribute VARCHAR(20),
DataValue VARCHAR(3),
DataAVRowID BIGINT
)
GO
INSERT INTO TestData (DataExternalID, DataAttribute, DataValue)
SELECT 'ABC123', 'manufacturer' , 'FD' UNION
SELECT 'ABC123', 'fueltype' , 'G' UNION
SELECT 'ABC123', 'bodystyle' , 'S' UNION
SELECT 'ABC123', 'doors' , '2' UNION
SELECT 'ABC123', 'engine' , '4' UNION
SELECT 'ABC123', 'transmission' , 'A' UNION
SELECT 'ABC123', 'trim' , 'RED'
INSERT INTO TestData (DataExternalID, DataAttribute, DataValue)
SELECT 'ABC456', 'manufacturer' , 'FD' UNION
SELECT 'ABC456', 'fueltype' , 'D' UNION
SELECT 'ABC456', 'bodystyle' , 'W' UNION
SELECT 'ABC456', 'doors' , '4' UNION
SELECT 'ABC456', 'engine' , '8' UNION
SELECT 'ABC456', 'transmission' , 'M' UNION
SELECT 'ABC456', 'trim' , 'BLK'
INSERT INTO TestData (DataExternalID, DataAttribute, DataValue)
SELECT 'ABC789', 'manufacturer' , 'GM' UNION
SELECT 'ABC789', 'fueltype' , 'H' UNION
SELECT 'ABC789', 'bodystyle' , 'B' UNION
SELECT 'ABC789', 'doors' , '4' UNION
SELECT 'ABC789', 'engine' , '8' UNION
SELECT 'ABC789', 'transmission' , 'A' UNION
SELECT 'ABC789', 'trim' , 'YLW'
GO
UPDATE TestData SET DataAVRowID = AVRowID
FROM TestAttributeValue INNER JOIN TestData ON AVAttribute = DataAttribute AND AVValue = DataValue
GO
CREATE UNIQUE CLUSTERED INDEX idx_TestData
ON TestData (DataExternalID,DataAVRowID)
GO
CREATE TABLE TestRule (
Ruleid INT IDENTITY(1,1) PRIMARY KEY,
RuleAttribute VARCHAR(20),
RuleValue VARCHAR(3),
RuleAVRowID BIGINT,
RuleRank AS POWER(2,RuleID)
)
GO
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'manufacturer', 'FD'
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'fueltype', 'D'
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'bodystyle', 'T'
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'doors', '2'
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'engine', '6'
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'transmission', 'M'
INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'trim', 'GRN'
GO
UPDATE TestRule SET RuleAVRowID = AVRowID
FROM TestAttributeValue INNER JOIN TestRule ON AVAttribute = RuleAttribute AND AVValue = RuleValue
GO
CREATE PROCEDURE TestProc
AS
SELECT TOP 1 DataExternalID
FROM (
SELECT DataExternalID, SUM(RuleRank) AS SumRuleRank
FROM (
SELECT *
FROM TestData TD
RIGHT OUTER JOIN TestRule TR ON TR.RuleAVRowID = TD.DataAVRowID
-- RIGHT OUTER JOIN TestRule TR ON TR.RuleAttribute = TD.DataAttribute AND TR.RuleValue = TD.DataValue
WHERE DataExternalID IS NOT NULL
) X
GROUP BY DataExternalID
) Y
ORDER BY SumRuleRank ASC
GO
EXEC TestProc
DROP TABLE TestAttributeValue
DROP TABLE TestData
DROP TABLE TestRule
GO
March 28, 2005 at 3:20 pm
Couple more thoughts as I checked the post to make sure it wasn't truncated...
With this implementation, the rows must be inserted into the TestRule table in sequence of evaluation. A simple modification would be to add a column called "EvalSequence". The RuleRank would then need to be based off of EvalSequence instead of RuleID.
TestRule implemented as a table in the prior post works for proof of concept purposes. However, a more realistic solution is to use a table variable inside of TestProc to hold the rules so that the solution is thread safe for multiple users. Rules and values can then readily be passed to the proc as either a list of comma separated keyword/value pairs; or more preferably, as a comma separated list of AVRowIDs in the proper sequence. The proc need not have any concept that attribute/values or rules even exist.
Finally, the current implementation does not show which rules were used to satisfy the query. Providing a flag on the TestRule table indicating rules that match the result would be a straightforward task.
I suppose that is enough for now. Thoughts? Comments? Have a good day!
Wayne
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply