I need a strange join, please help

Question

Post reply

I need a strange join, please help

salardx

Old Hand

Points: 316
More actions
April 9, 2016 at 5:04 am

#306268

Hi guys
assume we have two tables with the following design:
their callAnum and callBnum are easily joinable, however, their callGDatetime shows different values
I need to join these 2 tables based on callAnum, callBnum, callGDatetime in a way that their callGDatetime will be very close to each other! the minimum distance! something like this:
select * from table1
inner join table2 on
(
table1.callAnum = table2.callAnum and
table1.callBnum = table2.callBnum and
table1.callGDatetime minimum distance to table2.callGDatetime
)
Is there any sufficient way? there are millions of records! so the faster way, the better way
Thank you in advance

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 1

can you please post create table / insert data scripts and expected results based on your sample data

https://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 2

i'll throw this out here

make of it what you will and see how you get on.....no claims on performance or accuracy, need you to review to see if it what you are looking for

-- create a sample set of data

USE tempdb

GO

IF OBJECT_ID('tempdb..TABLE1','U') IS NOT NULL

DROP TABLE TABLE1;

IF OBJECT_ID('tempdb..TABLE2','U') IS NOT NULL

DROP TABLE TABLE2;

IF OBJECT_ID('tempdb..#jls','U') IS NOT NULL

DROP TABLE #jls

SELECT TOP (1000000)

callAnum = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),

callBnum = 1 + CAST(Abs(Checksum(Newid()) % 3) AS INT),

callGDateTime = RAND(CHECKSUM(NEWID())) * 7 + CAST('20160401' as datetime),

duration = 1 + CAST(Abs(Checksum(Newid()) % 50) AS INT)

INTO TABLE1

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

CREATE CLUSTERED INDEX [T1_CIX] ON [dbo].[TABLE1]

([callAnum] ASC,[callBnum] ASC,[callGDateTime] ASC)

SELECT TOP (1000000)

callAnum = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),

callBnum = 1 + CAST(Abs(Checksum(Newid()) % 3) AS INT),

callGDateTime = RAND(CHECKSUM(NEWID())) * 7 + CAST('20160401' as datetime),

duration = 1 + CAST(Abs(Checksum(Newid()) % 50) AS INT)

INTO TABLE2

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

CREATE CLUSTERED INDEX [T2_CIX] ON [dbo].[TABLE2]

([callAnum] ASC,[callBnum] ASC,[callGDateTime] ASC)

--- is the the type of results you are expecting??

SET STATISTICS TIME, IO ON;

WITH FWD_CTE AS (

SELECT T1.callAnum, T1.callBnum,T1.callGDateTime,

(

SELECT TOP 1 T2.callGDateTime

FROM TABLE2 T2

WHERE T1.callGDateTime < T2.callGDateTime

AND T1.callAnum = T2.callAnum

AND T1.callBnum = T2.callBnum

ORDER BY T2.callGDateTime

) fwd_dt

FROM TABLE1 T1

)

,

PRIOR_CTE AS (

SELECT T1.callAnum, T1.callBnum,T1.callGDateTime,

(

SELECT TOP 1 T2.callGDateTime

FROM TABLE2 T2

WHERE T2.callGDateTime < T1.callGDateTime

AND T1.callAnum = T2.callAnum

AND T1.callBnum = T2.callBnum

ORDER BY T2.callGDateTime DESC

) prior_dt

FROM

TABLE1 T1

)

SELECT FWD_CTE.callAnum,

FWD_CTE.callBnum,

FWD_CTE.callGDateTime,

FWD_CTE.fwd_dt,

PRIOR_CTE.prior_dt,

datediff(s,FWD_CTE.callGDateTime,FWD_CTE.fwd_dt) fwddatediff,

datediff(s,PRIOR_CTE.prior_dt , FWD_CTE.callGDateTime) priordatediff,

CASE

WHEN FWD_CTE.fwd_dt IS NULL then PRIOR_CTE.prior_dt

WHEN PRIOR_CTE.prior_dt IS NULL then FWD_CTE.fwd_dt

WHEN datediff(s,FWD_CTE.callGDateTime,FWD_CTE.fwd_dt) <= datediff(s,PRIOR_CTE.prior_dt , FWD_CTE.callGDateTime) then FWD_CTE.fwd_dt

ELSE PRIOR_CTE.prior_dt END

as selected_dt

INTO #jls

FROM FWD_CTE

INNER JOIN PRIOR_CTE ON FWD_CTE.callAnum = PRIOR_CTE.callAnum

AND FWD_CTE.callBnum = PRIOR_CTE.callBnum

AND FWD_CTE.callGDateTime = PRIOR_CTE.callGDateTime

ORDER BY

FWD_CTE.callAnum,

FWD_CTE.callBnum,

FWD_CTE.callGDateTime

SET STATISTICS TIME, IO OFF;

-- CHECK RESULTS

SELECT callAnum, callBnum, callGDateTime , T1, T2

FROM

(

SELECT callAnum, callBnum, callGDateTime, 'T1' as T1, NULL as T2

FROM TABLE1

WHERE (callAnum = 1) AND (callBnum = 1)

UNION ALL

SELECT callAnum, callBnum, callGDateTime, NULL as T1, 'T2' as T2

FROM TABLE2

WHERE (callAnum = 1) AND (callBnum = 1)

)x

ORDER BY callGDateTime

SELECT *

FROM #jls

WHERE (callAnum = 1) AND (callBnum = 1)

ORDER BY callAnum ASC,callBnum ASC,[callGDateTime] ASC

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Jeff Moden SSC Guru Points: 1004504 More actions · Answer 3

salardx (4/9/2016)
Hi guys
assume we have two tables with the following design:
their callAnum and callBnum are easily joinable, however, their callGDatetime shows different values
I need to join these 2 tables based on callAnum, callBnum, callGDatetime in a way that their callGDatetime will be very close to each other! the minimum distance! something like this:
select * from table1
inner join table2 on
(
table1.callAnum = table2.callAnum and
table1.callBnum = table2.callBnum and
table1.callGDatetime minimum distance to table2.callGDatetime
)
Is there any sufficient way? there are millions of records! so the faster way, the better way
Thank you in advance

This is a terribly unreliable way to do a join... and what if there's a tie between the closest "less than" and the closest "greater than"? What do you want done then?

Are there no columns that uniquely identify each row?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Hugo Kornelis SSC Guru Points: 64790 More actions · Answer 4

Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:

SELECT List, the, columns

FROM Table1 AS t1

CROSS APPLY

(SELECT TOP(1) *

FROM Table2 AS t2

WHERE t2.callAnum = t1.callAnum

AND t2.callBnum = t1.callBnum

ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))

) AS t2;

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Jeff Moden SSC Guru Points: 1004504 More actions · Answer 5

Hugo Kornelis (4/10/2016)
Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:
SELECT List, the, columns
FROM Table1 AS t1
CROSS APPLY
(SELECT TOP(1) *
FROM Table2 AS t2
WHERE t2.callAnum = t1.callAnum
AND t2.callBnum = t1.callBnum
ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))
) AS t2;

As nice as that is, it's still unreliable because of the data, or rather, what's not in the data.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

GSquared SSC Guru Points: 260824 More actions · Answer 6

Hugo Kornelis (4/10/2016)
Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:
SELECT List, the, columns
FROM Table1 AS t1
CROSS APPLY
(SELECT TOP(1) *
FROM Table2 AS t2
WHERE t2.callAnum = t1.callAnum
AND t2.callBnum = t1.callBnum
ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))
) AS t2;

You might also need the ABS(DATEDIFF())) piece inside the WHERE clause if you want to limit it to a certain span.

Otherwise, this is probably best-effort for the whole thing.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 7

It would be appreciated if the OP can post some sample data etc.

if there are millions of rows as suggested then it would be good to see some performance testing.

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Jeff Moden SSC Guru Points: 1004504 More actions · Answer 8

J Livingston SQL (4/11/2016)
It would be appreciated if the OP can post some sample data etc.
if there are millions of rows as suggested then it would be good to see some performance testing.

Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.

There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 9

Jeff Moden (4/11/2016)
J Livingston SQL (4/11/2016)
It would be appreciated if the OP can post some sample data etc.
if there are millions of rows as suggested then it would be good to see some performance testing.
Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.
There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.

agreed...

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

GSquared SSC Guru Points: 260824 More actions · Answer 10

Jeff Moden (4/11/2016)
J Livingston SQL (4/11/2016)
It would be appreciated if the OP can post some sample data etc.
if there are millions of rows as suggested then it would be good to see some performance testing.
Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.
There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.

Or it's possible there aren't.

I've had to work with connecting disparate data stores together based on "best pattern" instead of "exact match" numerous times.

If the business defines "closest in time" as what they want to match, then it's not wrong.

We only have part of the story. There might be more relevant data, or there might be more to the use-case than we know.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Sergiy SSC Guru Points: 110208 More actions · Answer 11

It looks line those 2 tables contain records about the same phone cals recorded by 2 different systems on opposite ends of the calls.

Clocks on those systems might be out os sync, and one of them may record when the call was initiated (calling side) and another one when the call was accepted (receiving side). Therefore time stamps and durations might be slightly different.

The correspondence can be established with higher level of certainty if to include into consideration additional logical points:

- there must be a way to establish approximate time difference between 2 servers, corresponding correction must be applied to one of the sides;

- time at the source of a call cannot be later than time at the receiving end;

- time at the initiating end may not be earlier than 1 minute (or whatever is call time-out duration) before the corresponding time at the receiving end;

- duration at receiving end must be longer than the duration at initiating end, but not longer than that + 1 minute.

If you incorporate this logic to a query the matching must become pretty certain.

Records must be matched only once.

Established matches must be recorded into a table in form of pairs of CallID's, so no other query would need to repeat this complicated matching process.

_____________
Code for TallyGenerator

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 12

here is the original post set up for SQL

USE tempdb

GO

IF OBJECT_ID('tempdb..TABLE1','U') IS NOT NULL

DROP TABLE TABLE1;

IF OBJECT_ID('tempdb..TABLE2','U') IS NOT NULL

DROP TABLE TABLE2;

CREATE TABLE Table1(

callID_T1 INT IDENTITY(1, 1) NOT NULL

,callAnum BIGINT NOT NULL

,callBnum INTEGER NOT NULL

,callGDateTime DATETIME NOT NULL

,duration INTEGER NOT NULL

);

INSERT INTO Table1(callAnum,callBnum,callGDateTime,duration) VALUES

(4436240307,96345,'2016-01-22 12:46:13.000',18)

,(4436240307,96345,'2016-01-23 13:21:11.000',9)

,(4436240307,96345,'2016-01-23 13:23:06.000',16)

,(4436240307,96345,'2016-01-23 13:25:29.000',5)

,(4436240307,96345,'2016-01-23 18:32:12.000',25)

,(4436240307,96345,'2016-01-24 17:23:19.000',7)

,(4436240307,96345,'2016-01-24 17:24:03.000',20)

,(4436240307,96345,'2016-01-25 13:32:34.000',16)

,(4436240307,96345,'2016-01-28 12:39:13.000',22)

,(4436240307,96345,'2016-01-28 13:12:57.000',5)

,(4436240307,96345,'2016-02-03 17:35:25.000',28)

,(4436240307,96345,'2016-02-14 19:24:08.000',19);

CREATE TABLE Table2(

callID_T2 INT IDENTITY(1, 1) NOT NULL

,callAnum BIGINT NOT NULL

,callBnum INTEGER NOT NULL

,callGDateTime DATETIME NOT NULL

,duration INTEGER NOT NULL

);

INSERT INTO Table2(callAnum,callBnum,callGDateTime,duration) VALUES

(4436240307,96345,'2016-01-22 12:43:57.000',19)

,(4436240307,96345,'2016-01-23 13:19:03.000',9)

,(4436240307,96345,'2016-01-23 13:20:52.000',17)

,(4436240307,96345,'2016-01-23 13:23:25.000',6)

,(4436240307,96345,'2016-01-23 18:29:48.000',25)

,(4436240307,96345,'2016-01-24 17:21:14.000',7)

,(4436240307,96345,'2016-01-24 17:21:45.000',21)

,(4436240307,96345,'2016-01-25 13:30:19.000',17)

,(4436240307,96345,'2016-01-28 12:36:52.000',23)

,(4436240307,96345,'2016-01-28 13:10:53.000',6)

,(4436240307,96345,'2016-02-03 17:32:57.000',29)

,(4436240307,96345,'2016-02-14 19:21:49.000',20);

If we then use Hugo's response

SELECT *

FROM Table1 AS t1

CROSS APPLY

(SELECT TOP(1) *

FROM Table2 AS t2

WHERE t2.callAnum = t1.callAnum

AND t2.callBnum = t1.callBnum

ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))

) AS t2;

this delivers the following result....note that rows 4 & 7 are each returned twice from Table 2.

Is this what you want?

+-------------------------------------------------------------------------------------------------------------------------------------------------+

¦ callID_T1 ¦ callAnum ¦ callBnum ¦ callGDateTime ¦ duration ¦ callID_T2 ¦ callAnum ¦ callBnum ¦ callGDateTime ¦ duration ¦

¦-----------+------------+----------+-------------------------+----------+-----------+------------+----------+-------------------------+----------¦

¦ 1 ¦ 4436240307 ¦ 96345 ¦ 2016-01-22 12:46:13.000 ¦ 18 ¦ 1 ¦ 4436240307 ¦ 96345 ¦ 2016-01-22 12:43:57.000 ¦ 19 ¦

¦ 2 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:21:11.000 ¦ 9 ¦ 3 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:20:52.000 ¦ 17 ¦

¦ 3 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:06.000 ¦ 16 ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:25.000 ¦ 6 ¦

¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:25:29.000 ¦ 5 ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:25.000 ¦ 6 ¦

¦ 5 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 18:32:12.000 ¦ 25 ¦ 5 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 18:29:48.000 ¦ 25 ¦

¦ 6 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:23:19.000 ¦ 7 ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:21:45.000 ¦ 21 ¦

¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:24:03.000 ¦ 20 ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:21:45.000 ¦ 21 ¦

¦ 8 ¦ 4436240307 ¦ 96345 ¦ 2016-01-25 13:32:34.000 ¦ 16 ¦ 8 ¦ 4436240307 ¦ 96345 ¦ 2016-01-25 13:30:19.000 ¦ 17 ¦

¦ 9 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 12:39:13.000 ¦ 22 ¦ 9 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 12:36:52.000 ¦ 23 ¦

¦ 10 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 13:12:57.000 ¦ 5 ¦ 10 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 13:10:53.000 ¦ 6 ¦

¦ 11 ¦ 4436240307 ¦ 96345 ¦ 2016-02-03 17:35:25.000 ¦ 28 ¦ 11 ¦ 4436240307 ¦ 96345 ¦ 2016-02-03 17:32:57.000 ¦ 29 ¦

¦ 12 ¦ 4436240307 ¦ 96345 ¦ 2016-02-14 19:24:08.000 ¦ 19 ¦ 12 ¦ 4436240307 ¦ 96345 ¦ 2016-02-14 19:21:49.000 ¦ 20 ¦

+-------------------------------------------------------------------------------------------------------------------------------------------------+

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Jeff Moden SSC Guru Points: 1004504 More actions · Answer 13

GSquared (4/11/2016)

If the business defines "closest in time" as what they want to match, then it's not wrong.

I have to disagree with that in a lot of cases. I consider it a part of my job to protect the business from bad requirements.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 14

not sure what the OP wants....keeps peeping out from behind the sofa each day and logs on......but never replies 🙂

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day