Fuzzy logic matching routine problem

  • Hi all,

    My first post on SQLServerCentral, and it's a good 'un!

    What I'm trying to do is find a clever way of writing a matching routine which refines the match until it gets one and only one match.

    Test on rule one: if that finds one and only one match, insert a record into output table for that id.  If it finds more than one match, add the next rule.  If it finds no match, return no match.

    Test on rule two: if that finds one and only one match based on rule one and rule two together, insert a record into the output table.  If it finds more than one match using the two rules, add the next rule.  If it find no match using these two rules, remove the second rule and add the third.

    In other words, you can't simply keep adding criteria to the matching routine, as there is a need to omit a rule if it kills the matching routine - the idea is that the rules get added in the best combination (but only in a particular order) until it gets only one match.

    Any thoughts, anyone?  I'm presuming a cursor will be required (so this isn't going to be quick on a table of 160,000 records), but rather than code every possible combination of tests explicitly, I'm sure there's a clever way of doing this.

    See, I told you it was a good one!

  • Why use cursor if they're not necessary.

    If i understand what you need is similar to this

    set @count = select count(1) from table where row1 = value1

    Case

      When @count = 0 Then

        return 'found nothing'

      When @count = 1 Then

        insert into table2(rows)

        select rows

        from table

        where row1 = value1

      When @count > 1 Then

      Begin

        set @count =

          select count(1)

          from table

          where row1 = value1

          and row2 = value2

      End

    and so on, probably you can use some kind of recursivity for your SP

  • Created on the fly no testing

    Probable poor performance

    Assumes complete sql (including rules/test) less than 4000 bytes

    CREATE TABLE #tests (rowid int IDENTITY(1,1),test nvarchar(100))

    INSERT INTO #tests (test) values ('where rule1')

    INSERT INTO #tests (test) values ('or rule2')

    INSERT INTO #tests (test) values ('or rule3')

    INSERT INTO #tests (test) values ('or rule4')

    DECLARE @startsql nvarchar(36), @sql nvarchar(4000), @test-2 nvarchar(4000), @count int, @maxct int, @result int

    SET @startsql = 'SELECT @result=COUNT(*) FROM

    '

    SET @count = 1

    SET @test-2 = =''

    SELECT @test-2 = @test-2 + ' ' + test

    FROM #tests

    WHERE rowid = @count

    ORDER BY rowid

    SET @sql = @startsql + @test-2

    EXEC sp_executesql @sql,N'@result int output',@result output

    SELECT @maxct = MAX(rowid) FROM #tests

    IF @result > 1

    BEGIN

      WHILE (@count < @maxct) AND (@result <> 1)

      BEGIN

      SET @test-2 = ''

      SET @count = @count + 1

      SELECT @test-2 = @test-2 + ' ' + test

      FROM #tests

      WHERE rowid <= @count

      ORDER BY rowid

      SET @sql = @startsql + @test-2

      EXEC sp_executesql @sql,N'@result int output',@result output

      IF @result = 0 THEN

        BEGIN

        DELETE FROM #tests WHERE rowid = @count

        SET @result = 99

        END

      END

    END

    -- @result = 0 No Match

    -- @result = 1 One Match and @test-2 = where clause

    -- @result > 1 no test(s) resulted in a single match

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Thanks for the replies.  The solution I had in mind was more like David's interpretation than Marco's, but both were gratefully received.  My initial post wasn't all that straightforward to understand, I know...

    I got this far in an attempt (rather clumsily) like the second before picking it up (excuse big code fragment at end of post).  The complexity is that there are 11 tests, which I'm storing in a table so that the order in which they are applied can be changed by a user.  However, I'm getting bogged down in string handling and caught in my nested loops as I'm trying to avoid cursors.

    If anyone can make anything of the following, I'd be very happy to discuss further

    -----------------------------------------------------------------------------------------------------------

    declare @newdataid  as int

    declare @ruleid as int

    declare @criteria as varchar (4000)

    declare @sql as nvarchar (4000)

    declare @updatesql as nvarchar(4000)

    declare @result as int

    declare @match as varchar(20)

    declare @capcode as varchar(20)

    declare @valuedate as varchar(20)

    declare @currentcriteria as varchar(200)

    declare @manufacturer as varchar(2)

    declare @newprice as int

    declare @fueltype as varchar(1)

    declare @body as varchar(1)

    declare @doors as int

    declare @engine as int

    declare @transmission as varchar(1)

    declare @fueldelivery as varchar(1)

    declare @trim as varchar(3)

    set @newdataid = 1

    set @ruleid = 2

    set @criteria = 'valuedate = @valuedate and manufacturer = @manufacturer'

    while @newdataid < (select max([id]) from newdata)+1

    begin

     set @capcode = (select capcode from newdata where [id] = @newdataid)

     set @valuedate = (select valuedate from newdata where [id] = @newdataid)

     set @manufacturer = (select manufacturer from newdata where [id] = @newdataid)

     set @newprice = (select newprice from newdata where [id] = @newdataid)

     set @fueltype = (select fueltype from newdata where [id]=@newdataid)

     set @body = (select body from newdata where [id]=@newdataid)

     set @doors = (select doors from newdata where [id]=@newdataid)

     set @engine = (select engine from newdata where [id]=@newdataid)

     set @transmission = (select transmission from newdata where [id]=@newdataid)

     set @fueldelivery = (select fueldelivery from newdata where [id]=@newdataid)

     set @trim = (select trim from newdata where [id]=@newdataid)

      while @ruleid < (select max([id]) from rules)+1

      begin

       set @criteria = replace(@criteria,'@manufacturer',''''+@manufacturer+'''')

       set @criteria = replace(@criteria,'@valuedate',''''+@valuedate+'''')

       set @currentcriteria = (select criteria from rules where [id]=@ruleid)

       set @currentcriteria = replace(@currentcriteria,'@manufacturer',''''+@manufacturer+'''')

       set @currentcriteria = replace(@currentcriteria,'@valuedate',''''+@valuedate+'''')

       set @currentcriteria = replace(@currentcriteria,'@newprice',@newprice)

       set @currentcriteria = replace(@currentcriteria,'@fueltype',''''+@fueltype+'''') 

       set @currentcriteria = replace(@currentcriteria,'@body',''''+@body+'''')

       set @currentcriteria = replace(@currentcriteria,'@doors',@doors)

       set @currentcriteria = replace(@currentcriteria,'@engine',@engine)

       set @currentcriteria = replace(@currentcriteria,'@transmission',''''+@transmission+'''')

       set @currentcriteria = replace(@currentcriteria,'@fueldelivery',''''+@fueldelivery+'''')

       set @currentcriteria = replace(@currentcriteria,'@trim',''''+@trim+'''')

       set @criteria = @criteria + ' and ' + @currentcriteria

       set @sql = N'insert into outputtable select count(*) from newdata where ' + @criteria

       set @updatesql = N'insert into outputcode select capcode from newdata where ' + @criteria

       select @sql

       truncate table outputtable

       exec sp_executesql @sql

       set @result = (select [output] from outputtable)

       select 'result is ',@result

       if @result = 1

        begin

         --select 'inside the result 1 loop',@result

         truncate table outputcode

         exec sp_executesql @updatesql

         set @match = (select [outputcode] from outputcode)

         --select @match

         update predecessorresults set match = @match where capcode = @capcode and valuedate = @valuedate

        set @ruleid = 2

        BREAK 

        end

       if @result = 0

        begin

         --select 'inside the result 0 loop',@result

         set @sql = substring(@sql,1,len(@sql)-(len(@currentcriteria)+5))

        end

       set @ruleid = @ruleid+1

      --select 'onto id',@ruleid

      end

    set @ruleid = 2

    set @newdataid = @newdataid + 1

    end

    ----------------------------------------------------------------------------------------------------------

  • I think you're going the wrong direction....i think an easier solution would be to start at the other end....test for the presence of 1 record "given all conditions applied"....and if no records found...remove 1 criteria and start again...eventually you will arrive at the required matching record.....if you end up matching a record because 5 conditions are true....even though the 1st 3 conditions would have been sufficient to get a match....it won't influence the result...you will still find the same matching record.

  • Firstly I would remove the subqueries from the while statements and do them once at the beginning, ie

    declare @maxdataid int, @maxruleid int

    select @maxdataid = isnull(max([id]),0) from newdata

    select @maxruleid = isnull(max([id]),0) from rules

    then

    while @newdataid <= @maxdataid

    ...

    while @ruleid <= @maxruleid

    ...

    Secondly replace the following

    set @ruleid = 2

        BREAK

    with

    set @ruleid = @maxruleid + 1

    set @newdataid = @maxdataid + 1

    to end the loops

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Thanks for everyones input - I got it working by my original method, posted here if anyone's interested.  It took a long time to run (6 hours on a decent server, against a large dataset) but got there in the end.

     

    declare @newdataid  as int

    declare @ruleid as int

    declare @criteria as varchar (4000)

    declare @sql as nvarchar (4000)

    declare @updatesql as nvarchar(4000)

    declare @result as int

    declare @match as varchar(20)

    declare @capcode as varchar(20)

    declare @valuedate as varchar(20)

    declare @currentcriteria as varchar(200)

    declare @manufacturer as varchar(2)

    declare @newprice as int

    declare @fueltype as varchar(1)

    declare @body as varchar(1)

    declare @doors as int

    declare @engine as int

    declare @transmission as varchar(1)

    declare @fueldelivery as varchar(1)

    declare @trim as varchar(3)

    declare @rule12sql as nvarchar(4000)

    declare @min-2 as int

    declare @max-2 as int

    set @newdataid = 1

    set @ruleid = 2

    set @criteria = 'valuedate = @valuedate and manufacturer = @manufacturer'

    while @newdataid <(select max([id]) from newdatatomatch)+1

    begin

     set @capcode = (select capcode from newdatatomatch where [id] = @newdataid)

     set @valuedate = (select valuedate from newdatatomatch where [id] = @newdataid)

     set @manufacturer = (select manufacturer from newdatatomatch where [id] = @newdataid)

     set @newprice = (select newprice from newdatatomatch where [id] = @newdataid)

     set @fueltype = (select fueltype from newdatatomatch where [id]=@newdataid)

     set @body = (select body from newdatatomatch where [id]=@newdataid)

     set @doors = (select doors from newdatatomatch where [id]=@newdataid)

     set @engine = (select engine from newdatatomatch where [id]=@newdataid)

     set @transmission = (select transmission from newdatatomatch where [id]=@newdataid)

     set @fueldelivery = (select fueldelivery from newdatatomatch where [id]=@newdataid)

     set @trim = (select trim from newdatatomatch where [id]=@newdataid)

      while @ruleid < (select max([id]) from rules)+1

      begin

       set @criteria = replace(@criteria,'@manufacturer',''''+@manufacturer+'''')

       set @criteria = replace(@criteria,'@valuedate',''''+@valuedate+'''')

       set @currentcriteria = (select criteria from rules where [id]=@ruleid)

       set @currentcriteria = replace(@currentcriteria,'@manufacturer',''''+@manufacturer+'''')

       set @currentcriteria = replace(@currentcriteria,'@valuedate',''''+@valuedate+'''')

       set @currentcriteria = replace(@currentcriteria,'@newprice',@newprice)

       set @currentcriteria = replace(@currentcriteria,'@fueltype',''''+@fueltype+'''') 

       set @currentcriteria = replace(@currentcriteria,'@body',''''+@body+'''')

       set @currentcriteria = replace(@currentcriteria,'@doors',@doors)

       set @currentcriteria = replace(@currentcriteria,'@engine',@engine)

       set @currentcriteria = replace(@currentcriteria,'@transmission',''''+@transmission+'''')

       set @currentcriteria = replace(@currentcriteria,'@fueldelivery',''''+@fueldelivery+'''')

       set @currentcriteria = replace(@currentcriteria,'@trim',''''+@trim+'''')

       set @criteria = @criteria + ' and ' + @currentcriteria

       set @sql = N'insert into outputtable select count(*) from newdatatomatch where ' + @criteria

       set @updatesql = N'insert into outputcode select capcode from newdatatomatch where ' + @criteria

       set @rule12sql = N'insert into rule12 select capcode, newprice from newdatatomatch where ' + @criteria

       --select @capcode,right(@sql,100)

       truncate table outputtable

       exec sp_executesql @sql

       set @result = (select [output] from outputtable)

       --select 'result is ',@result

       if @result = 0

        begin

         --select 'inside the result 0 loop',@result

         set @criteria = substring(@criteria,1,len(@criteria)-(len(@currentcriteria)+5))

        end    

       else   

       if @result = 1

        begin

         --select 'inside the result 1 loop',@result

         truncate table outputcode

         exec sp_executesql @updatesql

         set @match = (select [outputcode] from outputcode)

         --select @match

         update predecessorresults set match = @match where capcode = @capcode and valuedate = @valuedate

         --select @capcode, 'matches with',@match, 'on',@valuedate

        break

        continue

        end

       set @ruleid = @ruleid+1

       if @ruleid = 12

        begin

         --select 'inside ruleid12 loop',@result

         truncate table rule12

         exec sp_executesql @rule12sql

          set @min-2 = (select min(newprice) from rule12 where newprice > @newprice)

          set @max-2 = (select max(newprice) from rule12 where newprice < @newprice)

          set @match = (select r1.capcode from rule12 r1 where r1.newprice = (select case when @newprice-@min < @max-@newprice then @max-2 else @min-2 end))

          update predecessorresults set match = @match where capcode = @capcode and valuedate = @valuedate

          --select @capcode, 'matches with',@match, 'on',@valuedate

        break

        continue

        end

         

       --select 'onto id',@ruleid

      end

    set @ruleid = 2

    set @newdataid = @newdataid + 1

    set @criteria = 'valuedate = @valuedate and manufacturer = @manufacturer'

    --select 'onto dataid', @newdataid

    end

  • [BUMP]

    My solution as described above, even after refinement and indexing, is still too slow - has anyone any different thoughts on a way to do this?  One possibility I'm thinking of is scoring the match per field per record as a binary and finding the highest value of this binary:

    Record to search for: "a,b,c,d,e"

    Table of potentials

    "a,b,c,x,x" = 11100 = 26

    "a,x,x,d,e" = 10011 = 19

    "x,b,x,d,x" = 01010 = 12

    therefore "a,b,c,x,y" is the best match

    ...anyone?

     

  • mmm...interesting approach.

    can you put the calculation into a derived column...which may be indexed?

    you may have to run an update statement to evaluate every row w.r.t the 'a,b,c,d,e' tests and then check for the highest cumulative score.

    you seem to be suggesting that the 'rules' are independant of each other....but that also that the rules seem to have a precedence...with a better score for certain combinations?

     

    however....looking back at your code...I see that you can 'rationalise' some of it down...which may help.

    select @capcode=capcode,  @valuedate=valuedate, @manufacturer=manufacturer, etc.....from newdatatomatch where [id] = @newdataid)

    this should be somewhat better.

    Also....can you post the execution plans?

  • Thanks for the reply.

    Derived column - yes, that sounds like a plan.

    The number of updates would be an issue for speed, I think - which I why I initially favoured an iterative set-based approach as it would take out lots of results at once thereby limiting the number of rows per iteration after that.  However, easier said than done...

    The rules are independent of each other but do indeed have an order to them - the order is user-set by a front-end app, so that they can choose whether to match on one criteria is more important than another as this will affect the results.

    One thing which is bugging me on this - I've had some comp sci and maths training, and to me this sounds like the sort of problem that there's a "best" logical solution to, unfortunately for the life of me I can't figure it out, and haven't manage to google it either 

    Any clever buggrs out there, please feel free to put me out of my misery!

  • G'Day All,

    Ian: Thank you for a most interesting problem.

    This response is part one of two.  See the post immediately prior

    for an explanation of how I arrived at this point.

    I worked on two approaches in my spare time over the last few days.  The first solution proceeded down the path of derived columns and bitmapped values that can be readily indexed.  By assigning a power of two to each combination of attribute/value, I was able to construct a simple looping stored procedure that excluded rules that did not have any matching values.

    While this solution worked, there is a finite number of attribute value

    pairs that can be addressed by using a BIGINT.  Additionally, I am a firm believer that *almost* any procedural implementation can be written as a set based solution if one is sufficiently clever.

    The second solution is significantly more elegant, and relies on a

    strategy that borders on heresy, at least from a DB theory perspective.  Stated simply, turn columns into rows and use simple joins to eliminate rules that do not apply.  This approach is sometimes referred to as a keyword/value table.  From the resulting subset of data, select the single distinct row that matches has the "lowest" set of rules.  In the case of multiple matches, arbitrarily select the "top 1".

    There are well known problems with a character based, keyword/value implementation.  I acknowledge all of the issues.  However, one definition of an expert suggests that an expert knows the rules, and more importantly knows when and how to break the rules effectively.  I believe that this may be one of those times.

    The next modification to the algorithm was to extract all of the text for the Keyword/Value pairs out to a lookup table, and assign a unique numberic identifier to each combination using the identity column property.  By indexing on this table, and the appropriate columns on the actual data table, we should obtain reasonably good performance that scales well.

    The TestData and TestRule tables currently have columns supporting both approaches.  A reasonable next step would be to remove the VARCHAR attribute and value columns from these two tables.  They remain in place for now only to support testing and debugging where character data is a bit easier to understand than a bunch of integers.

    I am currently testing with large volumns of both attributes and values to see what happens to performance.  The actual code will appear in the next post.

    Wayne

  • G'Day All,

    Ian: Thank you again for a most interesting problem.

    This response is part two of two.  See the post immediately prior

    for an explanation of how I arrived at this point.

    Wayne

    CREATE TABLE TestAttributeValue (

        AVRowID       INT IDENTITY(1,1),

        AVAttribute   VARCHAR(20),

        AVValue       VARCHAR(3)

    )

    GO

    INSERT INTO TestAttributeValue (AVAttribute,AVValue)

         SELECT 'manufacturer', 'FD' UNION

         SELECT 'manufacturer', 'GM' UNION

         SELECT 'manufacturer', 'CH' UNION

         SELECT 'fueltype', 'G' UNION

         SELECT 'fueltype', 'D' UNION

         SELECT 'fueltype', 'E' UNION

         SELECT 'fueltype', 'H' UNION

         SELECT 'bodystyle', 'S' UNION

         SELECT 'bodystyle', 'W' UNION

         SELECT 'bodystyle', 'T' UNION

         SELECT 'bodystyle', 'B' UNION

         SELECT 'doors', '2' UNION

         SELECT 'doors', '4' UNION

         SELECT 'engine', '4' UNION

         SELECT 'engine', '6' UNION

         SELECT 'engine', '8' UNION

         SELECT 'transmission', 'A' UNION

         SELECT 'transmission', 'M' UNION

         SELECT 'trim', 'RED' UNION

         SELECT 'trim', 'ORG' UNION

         SELECT 'trim', 'YLW' UNION

         SELECT 'trim', 'GRN' UNION

         SELECT 'trim', 'BLU' UNION

         SELECT 'trim', 'IND' UNION

         SELECT 'trim', 'VLT' UNION

         SELECT 'trim', 'BLK' UNION

         SELECT 'trim', 'WHT'

    GO

    CREATE UNIQUE CLUSTERED INDEX idx_TestAttributeValue

       ON TestAttributeValue (AVAttribute,AVValue)

    GO

    CREATE TABLE TestData (

        DataRowID       INT IDENTITY(1,1),

        DataExternalID  VARCHAR(15),

        DataAttribute   VARCHAR(20),

        DataValue       VARCHAR(3),

        DataAVRowID     BIGINT

    )

    GO

    INSERT INTO TestData (DataExternalID, DataAttribute, DataValue)

                   SELECT 'ABC123', 'manufacturer' , 'FD' UNION

                   SELECT 'ABC123', 'fueltype'     , 'G' UNION

                   SELECT 'ABC123', 'bodystyle'    , 'S' UNION

                   SELECT 'ABC123', 'doors'        , '2' UNION

                   SELECT 'ABC123', 'engine'       , '4' UNION

                   SELECT 'ABC123', 'transmission' , 'A' UNION

                   SELECT 'ABC123', 'trim'         , 'RED'

    INSERT INTO TestData (DataExternalID, DataAttribute, DataValue)

                   SELECT 'ABC456', 'manufacturer' , 'FD' UNION

                   SELECT 'ABC456', 'fueltype'     , 'D' UNION

                   SELECT 'ABC456', 'bodystyle'    , 'W' UNION

                   SELECT 'ABC456', 'doors'        , '4' UNION

                   SELECT 'ABC456', 'engine'       , '8' UNION

                   SELECT 'ABC456', 'transmission' , 'M' UNION

                   SELECT 'ABC456', 'trim'         , 'BLK'

    INSERT INTO TestData (DataExternalID, DataAttribute, DataValue)

                   SELECT 'ABC789', 'manufacturer' , 'GM' UNION

                   SELECT 'ABC789', 'fueltype'     , 'H' UNION

                   SELECT 'ABC789', 'bodystyle'    , 'B' UNION

                   SELECT 'ABC789', 'doors'        , '4' UNION

                   SELECT 'ABC789', 'engine'       , '8' UNION

                   SELECT 'ABC789', 'transmission' , 'A' UNION

                   SELECT 'ABC789', 'trim'         , 'YLW'

    GO

    UPDATE TestData SET DataAVRowID = AVRowID

     FROM TestAttributeValue INNER JOIN TestData ON AVAttribute = DataAttribute  AND AVValue = DataValue

    GO

    CREATE UNIQUE CLUSTERED INDEX idx_TestData

       ON TestData (DataExternalID,DataAVRowID)

    GO

    CREATE TABLE TestRule (

            Ruleid          INT IDENTITY(1,1) PRIMARY KEY,

            RuleAttribute   VARCHAR(20),

            RuleValue       VARCHAR(3),

            RuleAVRowID     BIGINT,

            RuleRank        AS POWER(2,RuleID)

    )

    GO

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'manufacturer', 'FD'

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'fueltype',     'D'

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'bodystyle',    'T'

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'doors',        '2'

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'engine',       '6'

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'transmission', 'M'

    INSERT INTO TestRule (RuleAttribute, RuleValue) SELECT 'trim',         'GRN'

    GO

    UPDATE TestRule SET RuleAVRowID = AVRowID

     FROM TestAttributeValue INNER JOIN TestRule ON AVAttribute = RuleAttribute  AND AVValue = RuleValue

    GO

    CREATE PROCEDURE TestProc

    AS

    SELECT TOP 1 DataExternalID

    FROM (

            SELECT DataExternalID, SUM(RuleRank) AS SumRuleRank

            FROM (

                  SELECT *

                    FROM TestData TD

                   RIGHT OUTER JOIN TestRule TR ON TR.RuleAVRowID = TD.DataAVRowID

    --               RIGHT OUTER JOIN TestRule TR ON TR.RuleAttribute = TD.DataAttribute AND TR.RuleValue = TD.DataValue

                   WHERE DataExternalID IS NOT NULL

                 ) X

            GROUP BY DataExternalID

    ) Y

    ORDER BY SumRuleRank ASC

    GO

    EXEC TestProc

    DROP TABLE TestAttributeValue

    DROP TABLE TestData

    DROP TABLE TestRule

    GO

  • Couple more thoughts as I checked the post to make sure it wasn't truncated...

    With this implementation, the rows must be inserted into the TestRule table in sequence of evaluation.  A simple modification would be to add a column called "EvalSequence".  The RuleRank would then need to be based off of EvalSequence instead of RuleID.

    TestRule implemented as a table in the prior post works for proof of concept purposes.  However, a more realistic solution is to use a table variable inside of TestProc to hold the rules so that the solution is thread safe for multiple users.  Rules and values can then readily be passed to the proc as either a list of comma separated keyword/value pairs; or more preferably, as a comma separated list of AVRowIDs in the proper sequence.  The proc need not have any concept that attribute/values or rules even exist.

    Finally, the current implementation does not show which rules were used to satisfy the query.  Providing a flag on the TestRule table indicating rules that match the result would be a straightforward task.

    I suppose that is enough for now.  Thoughts?  Comments?  Have a good day!

    Wayne

     

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply