April 9, 2016 at 5:04 am
Hi guys
assume we have two tables with the following design:
their callAnum and callBnum are easily joinable, however, their callGDatetime shows different values
I need to join these 2 tables based on callAnum, callBnum, callGDatetime in a way that their callGDatetime will be very close to each other! the minimum distance! something like this:
select * from table1
inner join table2 on
(
table1.callAnum = table2.callAnum and
table1.callBnum = table2.callBnum and
table1.callGDatetime minimum distance to table2.callGDatetime
)
Is there any sufficient way? there are millions of records! so the faster way, the better way
Thank you in advance
April 9, 2016 at 5:09 am
can you please post create table / insert data scripts and expected results based on your sample data
https://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
April 9, 2016 at 12:02 pm
i'll throw this out here
make of it what you will and see how you get on.....no claims on performance or accuracy, need you to review to see if it what you are looking for
-- create a sample set of data
USE tempdb
GO
IF OBJECT_ID('tempdb..TABLE1','U') IS NOT NULL
DROP TABLE TABLE1;
IF OBJECT_ID('tempdb..TABLE2','U') IS NOT NULL
DROP TABLE TABLE2;
IF OBJECT_ID('tempdb..#jls','U') IS NOT NULL
DROP TABLE #jls
SELECT TOP (1000000)
callAnum = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),
callBnum = 1 + CAST(Abs(Checksum(Newid()) % 3) AS INT),
callGDateTime = RAND(CHECKSUM(NEWID())) * 7 + CAST('20160401' as datetime),
duration = 1 + CAST(Abs(Checksum(Newid()) % 50) AS INT)
INTO TABLE1
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
CREATE CLUSTERED INDEX [T1_CIX] ON [dbo].[TABLE1]
([callAnum] ASC,[callBnum] ASC,[callGDateTime] ASC)
SELECT TOP (1000000)
callAnum = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),
callBnum = 1 + CAST(Abs(Checksum(Newid()) % 3) AS INT),
callGDateTime = RAND(CHECKSUM(NEWID())) * 7 + CAST('20160401' as datetime),
duration = 1 + CAST(Abs(Checksum(Newid()) % 50) AS INT)
INTO TABLE2
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
CREATE CLUSTERED INDEX [T2_CIX] ON [dbo].[TABLE2]
([callAnum] ASC,[callBnum] ASC,[callGDateTime] ASC)
--- is the the type of results you are expecting??
SET STATISTICS TIME, IO ON;
WITH FWD_CTE AS (
SELECT T1.callAnum, T1.callBnum,T1.callGDateTime,
(
SELECT TOP 1 T2.callGDateTime
FROM TABLE2 T2
WHERE T1.callGDateTime < T2.callGDateTime
AND T1.callAnum = T2.callAnum
AND T1.callBnum = T2.callBnum
ORDER BY T2.callGDateTime
) fwd_dt
FROM TABLE1 T1
)
,
PRIOR_CTE AS (
SELECT T1.callAnum, T1.callBnum,T1.callGDateTime,
(
SELECT TOP 1 T2.callGDateTime
FROM TABLE2 T2
WHERE T2.callGDateTime < T1.callGDateTime
AND T1.callAnum = T2.callAnum
AND T1.callBnum = T2.callBnum
ORDER BY T2.callGDateTime DESC
) prior_dt
FROM
TABLE1 T1
)
SELECT FWD_CTE.callAnum,
FWD_CTE.callBnum,
FWD_CTE.callGDateTime,
FWD_CTE.fwd_dt,
PRIOR_CTE.prior_dt,
datediff(s,FWD_CTE.callGDateTime,FWD_CTE.fwd_dt) fwddatediff,
datediff(s,PRIOR_CTE.prior_dt , FWD_CTE.callGDateTime) priordatediff,
CASE
WHEN FWD_CTE.fwd_dt IS NULL then PRIOR_CTE.prior_dt
WHEN PRIOR_CTE.prior_dt IS NULL then FWD_CTE.fwd_dt
WHEN datediff(s,FWD_CTE.callGDateTime,FWD_CTE.fwd_dt) <= datediff(s,PRIOR_CTE.prior_dt , FWD_CTE.callGDateTime) then FWD_CTE.fwd_dt
ELSE PRIOR_CTE.prior_dt END
as selected_dt
INTO #jls
FROM FWD_CTE
INNER JOIN PRIOR_CTE ON FWD_CTE.callAnum = PRIOR_CTE.callAnum
AND FWD_CTE.callBnum = PRIOR_CTE.callBnum
AND FWD_CTE.callGDateTime = PRIOR_CTE.callGDateTime
ORDER BY
FWD_CTE.callAnum,
FWD_CTE.callBnum,
FWD_CTE.callGDateTime
SET STATISTICS TIME, IO OFF;
-- CHECK RESULTS
SELECT callAnum, callBnum, callGDateTime , T1, T2
FROM
(
SELECT callAnum, callBnum, callGDateTime, 'T1' as T1, NULL as T2
FROM TABLE1
WHERE (callAnum = 1) AND (callBnum = 1)
UNION ALL
SELECT callAnum, callBnum, callGDateTime, NULL as T1, 'T2' as T2
FROM TABLE2
WHERE (callAnum = 1) AND (callBnum = 1)
)x
ORDER BY callGDateTime
SELECT *
FROM #jls
WHERE (callAnum = 1) AND (callBnum = 1)
ORDER BY callAnum ASC,callBnum ASC,[callGDateTime] ASC
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
April 9, 2016 at 12:13 pm
salardx (4/9/2016)
Hi guysassume we have two tables with the following design:
their callAnum and callBnum are easily joinable, however, their callGDatetime shows different values
I need to join these 2 tables based on callAnum, callBnum, callGDatetime in a way that their callGDatetime will be very close to each other! the minimum distance! something like this:
select * from table1
inner join table2 on
(
table1.callAnum = table2.callAnum and
table1.callBnum = table2.callBnum and
table1.callGDatetime minimum distance to table2.callGDatetime
)
Is there any sufficient way? there are millions of records! so the faster way, the better way
Thank you in advance
This is a terribly unreliable way to do a join... and what if there's a tie between the closest "less than" and the closest "greater than"? What do you want done then?
Are there no columns that uniquely identify each row?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 10, 2016 at 4:59 am
Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:
SELECT List, the, columns
FROM Table1 AS t1
CROSS APPLY
(SELECT TOP(1) *
FROM Table2 AS t2
WHERE t2.callAnum = t1.callAnum
AND t2.callBnum = t1.callBnum
ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))
) AS t2;
April 11, 2016 at 6:07 am
Hugo Kornelis (4/10/2016)
Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:
SELECT List, the, columns
FROM Table1 AS t1
CROSS APPLY
(SELECT TOP(1) *
FROM Table2 AS t2
WHERE t2.callAnum = t1.callAnum
AND t2.callBnum = t1.callBnum
ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))
) AS t2;
As nice as that is, it's still unreliable because of the data, or rather, what's not in the data.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 11, 2016 at 1:42 pm
Hugo Kornelis (4/10/2016)
Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:
SELECT List, the, columns
FROM Table1 AS t1
CROSS APPLY
(SELECT TOP(1) *
FROM Table2 AS t2
WHERE t2.callAnum = t1.callAnum
AND t2.callBnum = t1.callBnum
ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))
) AS t2;
You might also need the ABS(DATEDIFF())) piece inside the WHERE clause if you want to limit it to a certain span.
Otherwise, this is probably best-effort for the whole thing.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 11, 2016 at 3:38 pm
It would be appreciated if the OP can post some sample data etc.
if there are millions of rows as suggested then it would be good to see some performance testing.
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
April 11, 2016 at 4:12 pm
J Livingston SQL (4/11/2016)
It would be appreciated if the OP can post some sample data etc.if there are millions of rows as suggested then it would be good to see some performance testing.
Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.
There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 11, 2016 at 4:19 pm
Jeff Moden (4/11/2016)
J Livingston SQL (4/11/2016)
It would be appreciated if the OP can post some sample data etc.if there are millions of rows as suggested then it would be good to see some performance testing.
Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.
There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.
agreed...
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
April 11, 2016 at 4:19 pm
Jeff Moden (4/11/2016)
J Livingston SQL (4/11/2016)
It would be appreciated if the OP can post some sample data etc.if there are millions of rows as suggested then it would be good to see some performance testing.
Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.
There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.
Or it's possible there aren't.
I've had to work with connecting disparate data stores together based on "best pattern" instead of "exact match" numerous times.
If the business defines "closest in time" as what they want to match, then it's not wrong.
We only have part of the story. There might be more relevant data, or there might be more to the use-case than we know.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 11, 2016 at 10:25 pm
It looks line those 2 tables contain records about the same phone cals recorded by 2 different systems on opposite ends of the calls.
Clocks on those systems might be out os sync, and one of them may record when the call was initiated (calling side) and another one when the call was accepted (receiving side). Therefore time stamps and durations might be slightly different.
The correspondence can be established with higher level of certainty if to include into consideration additional logical points:
- there must be a way to establish approximate time difference between 2 servers, corresponding correction must be applied to one of the sides;
- time at the source of a call cannot be later than time at the receiving end;
- time at the initiating end may not be earlier than 1 minute (or whatever is call time-out duration) before the corresponding time at the receiving end;
- duration at receiving end must be longer than the duration at initiating end, but not longer than that + 1 minute.
If you incorporate this logic to a query the matching must become pretty certain.
Records must be matched only once.
Established matches must be recorded into a table in form of pairs of CallID's, so no other query would need to repeat this complicated matching process.
_____________
Code for TallyGenerator
April 12, 2016 at 12:44 pm
here is the original post set up for SQL
USE tempdb
GO
IF OBJECT_ID('tempdb..TABLE1','U') IS NOT NULL
DROP TABLE TABLE1;
IF OBJECT_ID('tempdb..TABLE2','U') IS NOT NULL
DROP TABLE TABLE2;
CREATE TABLE Table1(
callID_T1 INT IDENTITY(1, 1) NOT NULL
,callAnum BIGINT NOT NULL
,callBnum INTEGER NOT NULL
,callGDateTime DATETIME NOT NULL
,duration INTEGER NOT NULL
);
INSERT INTO Table1(callAnum,callBnum,callGDateTime,duration) VALUES
(4436240307,96345,'2016-01-22 12:46:13.000',18)
,(4436240307,96345,'2016-01-23 13:21:11.000',9)
,(4436240307,96345,'2016-01-23 13:23:06.000',16)
,(4436240307,96345,'2016-01-23 13:25:29.000',5)
,(4436240307,96345,'2016-01-23 18:32:12.000',25)
,(4436240307,96345,'2016-01-24 17:23:19.000',7)
,(4436240307,96345,'2016-01-24 17:24:03.000',20)
,(4436240307,96345,'2016-01-25 13:32:34.000',16)
,(4436240307,96345,'2016-01-28 12:39:13.000',22)
,(4436240307,96345,'2016-01-28 13:12:57.000',5)
,(4436240307,96345,'2016-02-03 17:35:25.000',28)
,(4436240307,96345,'2016-02-14 19:24:08.000',19);
CREATE TABLE Table2(
callID_T2 INT IDENTITY(1, 1) NOT NULL
,callAnum BIGINT NOT NULL
,callBnum INTEGER NOT NULL
,callGDateTime DATETIME NOT NULL
,duration INTEGER NOT NULL
);
INSERT INTO Table2(callAnum,callBnum,callGDateTime,duration) VALUES
(4436240307,96345,'2016-01-22 12:43:57.000',19)
,(4436240307,96345,'2016-01-23 13:19:03.000',9)
,(4436240307,96345,'2016-01-23 13:20:52.000',17)
,(4436240307,96345,'2016-01-23 13:23:25.000',6)
,(4436240307,96345,'2016-01-23 18:29:48.000',25)
,(4436240307,96345,'2016-01-24 17:21:14.000',7)
,(4436240307,96345,'2016-01-24 17:21:45.000',21)
,(4436240307,96345,'2016-01-25 13:30:19.000',17)
,(4436240307,96345,'2016-01-28 12:36:52.000',23)
,(4436240307,96345,'2016-01-28 13:10:53.000',6)
,(4436240307,96345,'2016-02-03 17:32:57.000',29)
,(4436240307,96345,'2016-02-14 19:21:49.000',20);
If we then use Hugo's response
SELECT *
FROM Table1 AS t1
CROSS APPLY
(SELECT TOP(1) *
FROM Table2 AS t2
WHERE t2.callAnum = t1.callAnum
AND t2.callBnum = t1.callBnum
ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))
) AS t2;
this delivers the following result....note that rows 4 & 7 are each returned twice from Table 2.
Is this what you want?
+-------------------------------------------------------------------------------------------------------------------------------------------------+
¦ callID_T1 ¦ callAnum ¦ callBnum ¦ callGDateTime ¦ duration ¦ callID_T2 ¦ callAnum ¦ callBnum ¦ callGDateTime ¦ duration ¦
¦-----------+------------+----------+-------------------------+----------+-----------+------------+----------+-------------------------+----------¦
¦ 1 ¦ 4436240307 ¦ 96345 ¦ 2016-01-22 12:46:13.000 ¦ 18 ¦ 1 ¦ 4436240307 ¦ 96345 ¦ 2016-01-22 12:43:57.000 ¦ 19 ¦
¦ 2 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:21:11.000 ¦ 9 ¦ 3 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:20:52.000 ¦ 17 ¦
¦ 3 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:06.000 ¦ 16 ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:25.000 ¦ 6 ¦
¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:25:29.000 ¦ 5 ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:25.000 ¦ 6 ¦
¦ 5 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 18:32:12.000 ¦ 25 ¦ 5 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 18:29:48.000 ¦ 25 ¦
¦ 6 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:23:19.000 ¦ 7 ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:21:45.000 ¦ 21 ¦
¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:24:03.000 ¦ 20 ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:21:45.000 ¦ 21 ¦
¦ 8 ¦ 4436240307 ¦ 96345 ¦ 2016-01-25 13:32:34.000 ¦ 16 ¦ 8 ¦ 4436240307 ¦ 96345 ¦ 2016-01-25 13:30:19.000 ¦ 17 ¦
¦ 9 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 12:39:13.000 ¦ 22 ¦ 9 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 12:36:52.000 ¦ 23 ¦
¦ 10 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 13:12:57.000 ¦ 5 ¦ 10 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 13:10:53.000 ¦ 6 ¦
¦ 11 ¦ 4436240307 ¦ 96345 ¦ 2016-02-03 17:35:25.000 ¦ 28 ¦ 11 ¦ 4436240307 ¦ 96345 ¦ 2016-02-03 17:32:57.000 ¦ 29 ¦
¦ 12 ¦ 4436240307 ¦ 96345 ¦ 2016-02-14 19:24:08.000 ¦ 19 ¦ 12 ¦ 4436240307 ¦ 96345 ¦ 2016-02-14 19:21:49.000 ¦ 20 ¦
+-------------------------------------------------------------------------------------------------------------------------------------------------+
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
April 12, 2016 at 9:09 pm
GSquared (4/11/2016)
If the business defines "closest in time" as what they want to match, then it's not wrong.
I have to disagree with that in a lot of cases. I consider it a part of my job to protect the business from bad requirements.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 13, 2016 at 8:51 am
not sure what the OP wants....keeps peeping out from behind the sofa each day and logs on......but never replies 🙂
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply