November 30, 2010 at 1:17 pm
I have two tables; Left and Right. Each table contains up to 20 values (in separate columns). The Left table has about 350,000 rows, and the Right table has about 300,000 rows. For each Left row, we need to find the Right row that most closely matches it. Column order is important. Ideally, the matching Right row would be the one with the fewest changes, even if no columns match initially (see Row 2 in Right table). But as a first step, I'm OK with only counting the matches where the sequence and value match.
One easy way of doing this is to cross join LeftTable with RightTable, and use a case statement to add 1 if the value in Left.Value1 = Right.Value1, and add a 0 if they didn’t. Then, you just choose the one that has the highest value. It may be easier to understand written in code, so let me explain it that way.
select leftID, max(right('000' + convert(varchar,Matches),3) + '_' + RightID)
from (
select Left.ID LeftID,
Right.ID RightID,
case when Left.Col1 = Right.Col1 then 1 else 0 end +
case when Left.Col2 = Right.Col2 then 1 else 0 end +
case when Left.Col3= Right.Col3then 1 else 0 end +
…
case when Left.Col20= Right.Col20 then 1 else 0 end Matches
from Left
cross join Right
) a
group by
leftID
That’s a little crude, but hopefully you get the point. There of course could be multiple Right rows that match the Left row on the same # of columns. That’s ok, I’ll just pick the 1st one or whatever. The problem with this method is if you cross join 350k rows with 300k rows, you get…well, a lot of rows. I've tried limiting the left table to 10,100,1000, etc records at a time and looping thru but the time to complete everything is over 40hrs. I've also tried flipping the columns into rows and doing a join on Sequence and value, but that too takes forever.
Ideally, it would be great if there was a way to assign a "score" using some statistical/mathematical functions to each row in the left and right tables. This could then be used as a join condition to filter the number of results returned. So instead of looking at 300K for each Left row, you may only have to look at 1000. But so far, I haven't been able to come up with any formula that provides anything but random results.
Sample table data:
Left Table
1234567891011121314151617181920
Right Table
1516345678910111213141217181920Matches 16 perfectly
2345678910111213141516171819201Matches 20 with two changes
1234568791011121314152117181920Matches 17 perfectly
As far as hardware, this is running on a 16 processor box with 128GB RAM hitting a SAN with 70+ dedicated drives.
Any ideas are greatly appreciated. Been working on this for a couple of weeks and running out of things to try.
Thanks,
Greg
November 30, 2010 at 1:44 pm
Would it be possible to add a persisted computed column to each table based on col1*2^(1/8) + col2*2^(1/7)+...col16*2^7 or any other calculation that'd take the ordinal position into account and query based on min(abs(left.calculatedCol-right.calculatedCol)) ?
I'd try to get a value like this per column to do the comparsion. But I'm not sure if the math I provided would meet your requirements...
November 30, 2010 at 1:50 pm
SSIS Fuzzy Lookup might help. You can specify a similarity threshold
November 30, 2010 at 2:11 pm
did you experiment with CheckSum peristed ( and indexed ) computed columns to detect perfect matches quickly ?
, RwCheckSum as checksum(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20) PERSISTED
I do still have to test if checksum produces equal values in different tables, but it's worth a try.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
November 30, 2010 at 2:18 pm
ALZDBA (11/30/2010)
did you experiment with CheckSum peristed ( and indexed ) computed columns to detect perfect matches quickly ?
, RwCheckSum as checksum(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20) PERSISTED
I do still have to test if checksum produces equal values in different tables, but it's worth a try.
No, I haven't tried the checksum, but I am able to find the exact matches fairly quickly by joining the two tables on col1=col1 and col2=col2 etc. The issue is more with find ones that match on the most columns, but not all of them.
November 30, 2010 at 2:25 pm
ALZDBA (11/30/2010)
did you experiment with CheckSum computed columns to detect perfect matches quickly ?
, RwCheckSum as checksum(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20)
I do still have to test if checksum produces equal values in different tables, but it's worth a try.
@alzdba: seems like we're on the same track here. The checksum() does pretty much the same like I described above, just with a base of 16 instead of 2 (and in "reverse order").
Since it seems like we're dealing with 16 columns, we'd need to include 16 values in the checksum() calculation. This will bring it to it's limitation fairly easy:
@a=checksum(1,2,....,16) , @b-2=checksum(10,2,3,4,....,16) ,@c=checksum(14,2,3,4,....,16)
Which one is closer to @a? @b-2 or @C?
I'm getting an arithmetic overflow when I try (@a-@b) or (@a-@c)...
Therefore I decided to go with a Base2 calculation...
November 30, 2010 at 2:46 pm
You're getting the arithmatic overflow, because checksum generates:
declare @a int, @b-2 int, @C int
select @a=checksum(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
, @b-2=checksum(10,2,3,4,5,6,7,8,9,1,11,12,13,14,15,16)
,@c=checksum(14,2,3,4,5,6,7,8,9,10,11,12,13,1,15,16)
-- Which one is closer to @a? @b-2 or @C?
-- I'm getting an arithmetic overflow when I try (@a-@b) or (@a-@c)...
a b c
-20043180568645858802022213528
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
November 30, 2010 at 3:06 pm
ALZDBA (11/30/2010)
You're getting the arithmatic overflow, because checksum generates:
declare @a int, @b-2 int, @C int
select @a=checksum(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
, @b-2=checksum(10,2,3,4,5,6,7,8,9,1,11,12,13,14,15,16)
,@c=checksum(14,2,3,4,5,6,7,8,9,10,11,12,13,1,15,16)
-- Which one is closer to @a? @b-2 or @C?
-- I'm getting an arithmetic overflow when I try (@a-@b) or (@a-@c)...
a b c
-20043180568645858802022213528
@alzdba - I've been looking at the check sum option as well with similar code. Based on your examples, @a is 2 columns different for both @b-2 and @C. Yet if you look at the check sum values they vary widely. I'm not how I will be able to use this to compare. Am I missing something?
@LutzM - I've been looking at your sample and am not sure I have it correct. Here's the sample I'm working with. @d matches @a the closest and based on the values that holds true, but what concerns me is @C. It doesn't match @a on any of the columns but it's value is very close to @d. Is this what you had in mind?
declare @a bigint,
@b-2 bigint,
@C bigint,
@d bigint
--Left Table
--@a = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
--Right Table
--@b = 15,16,3,4,5,6,7,8,9,10,11,12,13,14,1,2,17,18,19,20
--@c = 20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1
--@d = 1,2,3,4,5,6,8,7,9,10,11,12,13,14,15,21,17,18,19,20
--Left Table
select @a = 1*2^(1/10) +
2*2^(1/9) +
3*2^(1/8) +
4*2^(1/7) +
5*2^(1/6) +
6*2^(1/5) +
7*2^(1/4) +
8*2^(1/3) +
9*2^(1/2) +
10*2^(1) +
11*2^(2) +
12*2^(3) +
13*2^(4) +
14*2^(5) +
15*2^(6) +
16*2^(7) +
17*2^(8) +
18*2^(9) +
19*2^(10) +
20*2^(11)
select @a
--Right Table
select @b-2 = 15*2^(1/10) +
16*2^(1/9) +
3*2^(1/8) +
4*2^(1/7) +
5*2^(1/6) +
6*2^(1/5) +
7*2^(1/4) +
8*2^(1/3) +
9*2^(1/2) +
10*2^(1) +
11*2^(2) +
12*2^(3) +
13*2^(4) +
14*2^(5) +
1*2^(6) +
2*2^(7) +
17*2^(8) +
18*2^(9) +
19*2^(10) +
20*2^(11)
select @b-2
select @C = 20*2^(1/10) +
19*2^(1/9) +
18*2^(1/8) +
17*2^(1/7) +
16*2^(1/6) +
15*2^(1/5) +
14*2^(1/4) +
13*2^(1/3) +
12*2^(1/2) +
11*2^(1) +
10*2^(2) +
9*2^(3) +
8*2^(4) +
7*2^(5) +
6*2^(6) +
5*2^(7) +
4*2^(8) +
3*2^(9) +
2*2^(10) +
1*2^(11)
select @C
select @d = 1*2^(1/10) +
2*2^(1/9) +
3*2^(1/8) +
4*2^(1/7) +
5*2^(1/6) +
6*2^(1/5) +
7*2^(1/4) +
8*2^(1/3) +
9*2^(1/2) +
10*2^(1) +
11*2^(2) +
12*2^(3) +
13*2^(4) +
14*2^(5) +
15*2^(6) +
21*2^(7) +
17*2^(8) +
18*2^(9) +
19*2^(10) +
20*2^(11)
select @d
@a = 428
@b-2 = 388
@C = 416
@d = 418
November 30, 2010 at 3:10 pm
ALZDBA (11/30/2010)
You're getting the arithmatic overflow, because checksum generates:
declare @a int, @b-2 int, @C int
select @a=checksum(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
, @b-2=checksum(10,2,3,4,5,6,7,8,9,1,11,12,13,14,15,16)
,@c=checksum(14,2,3,4,5,6,7,8,9,10,11,12,13,1,15,16)
-- Which one is closer to @a? @b-2 or @C?
-- I'm getting an arithmetic overflow when I try (@a-@b) or (@a-@c)...
a b c
-20043180568645858802022213528
I know. 😉 Therefore I decided not to use checksum() but went with the Base2 version instead. My question was more towards your proposal: How would you deal with that scenario? The solution I posted wouldn't fail with an error...
November 30, 2010 at 3:21 pm
The checksum values themselves are not comparable, with regards to order, >, < !
It's just a hash value, so can only be used in an equation.
But it will be noticeable faster to just compare on that value than the 16 col join.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
December 1, 2010 at 4:59 am
are you able to use a CLR?
i believe this sort of processing is best done within the .net framework
--
Thiago Dantas
@DantHimself
December 1, 2010 at 6:07 am
dant12 (12/1/2010)
are you able to use a CLR?i believe this sort of processing is best done within the .net framework
Yes, I am able to use .Net. I've tried a command line app which pulls both tables into data tables and a couple of nested for loops, but performance on that wasn't any faster that SQL. Do you have a better way to compare them in .Net?
December 1, 2010 at 12:29 pm
have you tried a Parallel.ForEach or some sort of parallel processing?
with your server specs it might reduce the execution time.
i'll post some code soon if you want to try
--
Thiago Dantas
@DantHimself
December 1, 2010 at 1:14 pm
No, I haven't tried parallel processing. If you have some sample code, I'd like to try it.
December 1, 2010 at 1:30 pm
ill post on what i been testing
create the 2 tables
IF OBJECT_ID('TEMP_1') IS NOT NULL
DROP TABLE TEMP_1
IF OBJECT_ID('TEMP_2') IS NOT NULL
DROP TABLE TEMP_2
GO
--CREATE TABLES
DECLARE @a VARCHAR(1000),@B INT
SELECT @a = '',@B = 0
SELECT TOP 20 @b-2 = @b-2+1,@A= @a + 'COL'+CAST(@B AS VARCHAR(3))+' INT,'
FROM SYSCOLUMNS
SET @a = LEFT(@A,LEN(@A)-1)
EXEC('CREATE TABLE TEMP_1 ('+@A+')')
EXEC('CREATE TABLE TEMP_2 ('+@A+')')
GO
insert test data (locally i been using lesser rows as my dev machine sucks balls)
--GERENERATE TEST DATA
INSERT INTO TEMP_1
SELECT TOP 350000
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30
FROM MASTER.SYS.all_columns A, MASTER.SYS.all_columns B
GO
INSERT INTO TEMP_2
SELECT TOP 300000
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30,
RAND(CHECKSUM(NEWID())) * 30
FROM MASTER.SYS.all_columns A, MASTER.SYS.all_columns B
GO
create a table to hold identity values for each best-match as well as the score
CREATE TABLE BEST_MATCHES (LEFT_TABLE INT, RIGHT_TABLE INT, SCORE INT)
add identity columns to left and right tables (needed to insert into the best_matches table)
ALTER TABLE TEMP_1 ADD IDCOL INT IDENTITY(1,1)
ALTER TABLE TEMP_2 ADD IDCOL INT IDENTITY(1,1)
and finally the console application code
i XXXed the login pass and database name, you can change those by yourself.
Imports System.Data.SqlClient
Imports System.Threading
Imports System.Threading.Tasks
Module Module1
Public Sub Main()
Dim dtbLeft As New DataTable
Dim dtbRight As New DataTable
Dim Summary As New Generic.List(Of Object)
'fill 2 datatables with the left and right tables
Using conn As New SqlConnection("server='(local)'; user id='XXX'; password='XXX'; Database='XXX'; Connection Lifetime=240;Pooling=false;Connect Timeout=1200")
conn.Open()
Using cmd As New SqlCommand("SELECT * FROM TEMP_1", conn) 'change table names as required
Using DA As New SqlDataAdapter(cmd)
DA.Fill(dtbLeft)
End Using
End Using
Using cmd As New SqlCommand("SELECT * FROM TEMP_2", conn)
Using DA As New SqlDataAdapter(cmd)
DA.Fill(dtbRight)
End Using
End Using
conn.Close()
End Using
'start the parallel foreach that will process try matching the left with right and add best results to the Summary collection
Parallel.ForEach(dtbLeft.Select(), _
Sub(LeftLine)
Dim CurrentScore As Integer
Dim obj() As Object = {0, LeftLine.Item(20), 0}
For x As Integer = 0 To dtbRight.Rows.Count - 1
CurrentScore = 0
CurrentScore += IIf(LeftLine.Item(0).ToString = dtbRight.Rows(x).Item(0).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(1).ToString = dtbRight.Rows(x).Item(1).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(2).ToString = dtbRight.Rows(x).Item(2).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(3).ToString = dtbRight.Rows(x).Item(3).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(4).ToString = dtbRight.Rows(x).Item(4).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(5).ToString = dtbRight.Rows(x).Item(5).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(6).ToString = dtbRight.Rows(x).Item(6).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(7).ToString = dtbRight.Rows(x).Item(7).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(8).ToString = dtbRight.Rows(x).Item(8).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(9).ToString = dtbRight.Rows(x).Item(9).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(10).ToString = dtbRight.Rows(x).Item(10).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(11).ToString = dtbRight.Rows(x).Item(11).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(12).ToString = dtbRight.Rows(x).Item(12).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(13).ToString = dtbRight.Rows(x).Item(13).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(14).ToString = dtbRight.Rows(x).Item(14).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(15).ToString = dtbRight.Rows(x).Item(15).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(16).ToString = dtbRight.Rows(x).Item(16).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(17).ToString = dtbRight.Rows(x).Item(17).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(18).ToString = dtbRight.Rows(x).Item(18).ToString, 1, 0)
CurrentScore += IIf(LeftLine.Item(19).ToString = dtbRight.Rows(x).Item(19).ToString, 1, 0)
If CurrentScore > obj(0) Then
obj(0) = CurrentScore
obj(2) = dtbRight.Rows(x).Item(20)
End If
Next
Summary.Add(obj)
End Sub)
' insert results back into table
Using conn As New SqlConnection("server='(local)'; user id='XXX'; password='XXX'; Database='XXX'; Connection Lifetime=240;Pooling=false;Connect Timeout=1200")
conn.Open()
Dim temp As New System.Text.StringBuilder
'creates a big delimited string to kill database roundtrips
For n As Integer = 0 To Summary.Count - 1
temp.Append(Summary(n)(1).ToString & "," & Summary(n)(2).ToString & "," & Summary(n)(0).ToString & ";")
Next
temp.Remove(temp.Length - 2, 1)
Dim cmd As New SqlCommand
cmd.Connection = conn
'query to parse the big string back into 3 columns
Dim Query As New System.Text.StringBuilder
Query.Append("WITH TALLY AS ")
Query.Append("( ")
Query.Append("SELECT TOP (@TOP) ROW_NUMBER() OVER (ORDER BY a.COLUMN_ID) N ")
Query.Append("FROM master.sys.all_columns a,master.sys.all_columns b ")
Query.Append("), ROW AS ")
Query.Append("( ")
Query.Append("SELECT SUBSTRING(';'+@TEXT+';',N+1,CHARINDEX(';',';'+@TEXT+';',N+1)-N-1) ROW ")
Query.Append(" ,ROW_NUMBER() OVER (ORDER BY N) ID ")
Query.Append("FROM TALLY ")
Query.Append("WHERE N < LEN(';'+@TEXT)+1 ")
Query.Append("AND SUBSTRING(';'+@TEXT+';',N,1) = ';' ")
Query.Append("), COLS AS ")
Query.Append("( ")
Query.Append("SELECT SUBSTRING(','+ROW+',',N+1,CHARINDEX(',',','+ROW+',',N+1)-N-1) COL, ID, ROW_NUMBER() OVER (ORDER BY N) ID2 ")
Query.Append("FROM ROW,TALLY ")
Query.Append("WHERE N < LEN(','+ROW)+1 ")
Query.Append("AND SUBSTRING(','+ROW+',',N,1) = ',' ")
Query.Append(") ")
Query.Append("INSERT INTO BEST_MATCHES ")
Query.Append("SELECT A.COL, B.COL, C.COL ")
Query.Append("FROM COLS A ")
Query.Append("INNER JOIN COLS B ON A.ID = B.ID AND A.ID2 < B.ID2 ")
Query.Append("INNER JOIN COLS C ON A.ID = C.ID AND B.ID2 < C.ID2 ")
cmd.CommandText = Query.ToString
cmd.Parameters.Add("@TEXT", SqlDbType.VarChar, -1).Value = temp.ToString
cmd.Parameters.Add("@TOP", SqlDbType.Int).Value = temp.Length
cmd.ExecuteNonQuery()
conn.Close()
End Using
Console.WriteLine("Done")
End Sub
End Module
as you can see it's nothing really fancy and could be improved by someone who actually write's code in vb/c# (unlike me)
not really sure it will make it super fast, or if it will improve anything at all, but i think it's worth a shot
you will also need the .net framework 4.0 for this to work(i think you also need visual studio 2010, not sure)
--
Thiago Dantas
@DantHimself
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply