I need a strange join, please help

  • Hi guys

    assume we have two tables with the following design:

    their callAnum and callBnum are easily joinable, however, their callGDatetime shows different values

    I need to join these 2 tables based on callAnum, callBnum, callGDatetime in a way that their callGDatetime will be very close to each other! the minimum distance! something like this:

    select * from table1

    inner join table2 on

    (

    table1.callAnum = table2.callAnum and

    table1.callBnum = table2.callBnum and

    table1.callGDatetime minimum distance to table2.callGDatetime

    )

    Is there any sufficient way? there are millions of records! so the faster way, the better way

    Thank you in advance

  • can you please post create table / insert data scripts and expected results based on your sample data

    https://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • i'll throw this out here

    make of it what you will and see how you get on.....no claims on performance or accuracy, need you to review to see if it what you are looking for

    -- create a sample set of data

    USE tempdb

    GO

    IF OBJECT_ID('tempdb..TABLE1','U') IS NOT NULL

    DROP TABLE TABLE1;

    IF OBJECT_ID('tempdb..TABLE2','U') IS NOT NULL

    DROP TABLE TABLE2;

    IF OBJECT_ID('tempdb..#jls','U') IS NOT NULL

    DROP TABLE #jls

    SELECT TOP (1000000)

    callAnum = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),

    callBnum = 1 + CAST(Abs(Checksum(Newid()) % 3) AS INT),

    callGDateTime = RAND(CHECKSUM(NEWID())) * 7 + CAST('20160401' as datetime),

    duration = 1 + CAST(Abs(Checksum(Newid()) % 50) AS INT)

    INTO TABLE1

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CREATE CLUSTERED INDEX [T1_CIX] ON [dbo].[TABLE1]

    ([callAnum] ASC,[callBnum] ASC,[callGDateTime] ASC)

    SELECT TOP (1000000)

    callAnum = 1 + CAST(Abs(Checksum(Newid()) % 1000 ) AS INT),

    callBnum = 1 + CAST(Abs(Checksum(Newid()) % 3) AS INT),

    callGDateTime = RAND(CHECKSUM(NEWID())) * 7 + CAST('20160401' as datetime),

    duration = 1 + CAST(Abs(Checksum(Newid()) % 50) AS INT)

    INTO TABLE2

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CREATE CLUSTERED INDEX [T2_CIX] ON [dbo].[TABLE2]

    ([callAnum] ASC,[callBnum] ASC,[callGDateTime] ASC)

    --- is the the type of results you are expecting??

    SET STATISTICS TIME, IO ON;

    WITH FWD_CTE AS (

    SELECT T1.callAnum, T1.callBnum,T1.callGDateTime,

    (

    SELECT TOP 1 T2.callGDateTime

    FROM TABLE2 T2

    WHERE T1.callGDateTime < T2.callGDateTime

    AND T1.callAnum = T2.callAnum

    AND T1.callBnum = T2.callBnum

    ORDER BY T2.callGDateTime

    ) fwd_dt

    FROM TABLE1 T1

    )

    ,

    PRIOR_CTE AS (

    SELECT T1.callAnum, T1.callBnum,T1.callGDateTime,

    (

    SELECT TOP 1 T2.callGDateTime

    FROM TABLE2 T2

    WHERE T2.callGDateTime < T1.callGDateTime

    AND T1.callAnum = T2.callAnum

    AND T1.callBnum = T2.callBnum

    ORDER BY T2.callGDateTime DESC

    ) prior_dt

    FROM

    TABLE1 T1

    )

    SELECT FWD_CTE.callAnum,

    FWD_CTE.callBnum,

    FWD_CTE.callGDateTime,

    FWD_CTE.fwd_dt,

    PRIOR_CTE.prior_dt,

    datediff(s,FWD_CTE.callGDateTime,FWD_CTE.fwd_dt) fwddatediff,

    datediff(s,PRIOR_CTE.prior_dt , FWD_CTE.callGDateTime) priordatediff,

    CASE

    WHEN FWD_CTE.fwd_dt IS NULL then PRIOR_CTE.prior_dt

    WHEN PRIOR_CTE.prior_dt IS NULL then FWD_CTE.fwd_dt

    WHEN datediff(s,FWD_CTE.callGDateTime,FWD_CTE.fwd_dt) <= datediff(s,PRIOR_CTE.prior_dt , FWD_CTE.callGDateTime) then FWD_CTE.fwd_dt

    ELSE PRIOR_CTE.prior_dt END

    as selected_dt

    INTO #jls

    FROM FWD_CTE

    INNER JOIN PRIOR_CTE ON FWD_CTE.callAnum = PRIOR_CTE.callAnum

    AND FWD_CTE.callBnum = PRIOR_CTE.callBnum

    AND FWD_CTE.callGDateTime = PRIOR_CTE.callGDateTime

    ORDER BY

    FWD_CTE.callAnum,

    FWD_CTE.callBnum,

    FWD_CTE.callGDateTime

    SET STATISTICS TIME, IO OFF;

    -- CHECK RESULTS

    SELECT callAnum, callBnum, callGDateTime , T1, T2

    FROM

    (

    SELECT callAnum, callBnum, callGDateTime, 'T1' as T1, NULL as T2

    FROM TABLE1

    WHERE (callAnum = 1) AND (callBnum = 1)

    UNION ALL

    SELECT callAnum, callBnum, callGDateTime, NULL as T1, 'T2' as T2

    FROM TABLE2

    WHERE (callAnum = 1) AND (callBnum = 1)

    )x

    ORDER BY callGDateTime

    SELECT *

    FROM #jls

    WHERE (callAnum = 1) AND (callBnum = 1)

    ORDER BY callAnum ASC,callBnum ASC,[callGDateTime] ASC

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • salardx (4/9/2016)


    Hi guys

    assume we have two tables with the following design:

    their callAnum and callBnum are easily joinable, however, their callGDatetime shows different values

    I need to join these 2 tables based on callAnum, callBnum, callGDatetime in a way that their callGDatetime will be very close to each other! the minimum distance! something like this:

    select * from table1

    inner join table2 on

    (

    table1.callAnum = table2.callAnum and

    table1.callBnum = table2.callBnum and

    table1.callGDatetime minimum distance to table2.callGDatetime

    )

    Is there any sufficient way? there are millions of records! so the faster way, the better way

    Thank you in advance

    This is a terribly unreliable way to do a join... and what if there's a tie between the closest "less than" and the closest "greater than"? What do you want done then?

    Are there no columns that uniquely identify each row?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:

    SELECT List, the, columns

    FROM Table1 AS t1

    CROSS APPLY

    (SELECT TOP(1) *

    FROM Table2 AS t2

    WHERE t2.callAnum = t1.callAnum

    AND t2.callBnum = t1.callBnum

    ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))

    ) AS t2;


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (4/10/2016)


    Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:

    SELECT List, the, columns

    FROM Table1 AS t1

    CROSS APPLY

    (SELECT TOP(1) *

    FROM Table2 AS t2

    WHERE t2.callAnum = t1.callAnum

    AND t2.callBnum = t1.callBnum

    ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))

    ) AS t2;

    As nice as that is, it's still unreliable because of the data, or rather, what's not in the data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hugo Kornelis (4/10/2016)


    Without CREATE TABLE and INSERT statments I cannot test anything, but my gut feeling is that you will end up with a pattern somewhat like the below:

    SELECT List, the, columns

    FROM Table1 AS t1

    CROSS APPLY

    (SELECT TOP(1) *

    FROM Table2 AS t2

    WHERE t2.callAnum = t1.callAnum

    AND t2.callBnum = t1.callBnum

    ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))

    ) AS t2;

    You might also need the ABS(DATEDIFF())) piece inside the WHERE clause if you want to limit it to a certain span.

    Otherwise, this is probably best-effort for the whole thing.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • It would be appreciated if the OP can post some sample data etc.

    if there are millions of rows as suggested then it would be good to see some performance testing.

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (4/11/2016)


    It would be appreciated if the OP can post some sample data etc.

    if there are millions of rows as suggested then it would be good to see some performance testing.

    Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.

    There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/11/2016)


    J Livingston SQL (4/11/2016)


    It would be appreciated if the OP can post some sample data etc.

    if there are millions of rows as suggested then it would be good to see some performance testing.

    Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.

    There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.

    agreed...

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Jeff Moden (4/11/2016)


    J Livingston SQL (4/11/2016)


    It would be appreciated if the OP can post some sample data etc.

    if there are millions of rows as suggested then it would be good to see some performance testing.

    Heh... performance test how fast we can come up with the wrong answers? 😛 Might as well just do a ROW_NUMBER() in temporal order partitioned by CallAnum and CallBnum and join on that.

    There have to be some columns missing that would tell the rest of the story. If "Call" is representative of "Telephone Call", there's a whole lot more information available that would help with correct matching. Otherwise, it's a temporal crapshoot. Even if it has nothing to do with telephony, it's difficult to believe that the columns the OP presented are the only columns available.

    Or it's possible there aren't.

    I've had to work with connecting disparate data stores together based on "best pattern" instead of "exact match" numerous times.

    If the business defines "closest in time" as what they want to match, then it's not wrong.

    We only have part of the story. There might be more relevant data, or there might be more to the use-case than we know.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • It looks line those 2 tables contain records about the same phone cals recorded by 2 different systems on opposite ends of the calls.

    Clocks on those systems might be out os sync, and one of them may record when the call was initiated (calling side) and another one when the call was accepted (receiving side). Therefore time stamps and durations might be slightly different.

    The correspondence can be established with higher level of certainty if to include into consideration additional logical points:

    - there must be a way to establish approximate time difference between 2 servers, corresponding correction must be applied to one of the sides;

    - time at the source of a call cannot be later than time at the receiving end;

    - time at the initiating end may not be earlier than 1 minute (or whatever is call time-out duration) before the corresponding time at the receiving end;

    - duration at receiving end must be longer than the duration at initiating end, but not longer than that + 1 minute.

    If you incorporate this logic to a query the matching must become pretty certain.

    Records must be matched only once.

    Established matches must be recorded into a table in form of pairs of CallID's, so no other query would need to repeat this complicated matching process.

    _____________
    Code for TallyGenerator

  • here is the original post set up for SQL

    USE tempdb

    GO

    IF OBJECT_ID('tempdb..TABLE1','U') IS NOT NULL

    DROP TABLE TABLE1;

    IF OBJECT_ID('tempdb..TABLE2','U') IS NOT NULL

    DROP TABLE TABLE2;

    CREATE TABLE Table1(

    callID_T1 INT IDENTITY(1, 1) NOT NULL

    ,callAnum BIGINT NOT NULL

    ,callBnum INTEGER NOT NULL

    ,callGDateTime DATETIME NOT NULL

    ,duration INTEGER NOT NULL

    );

    INSERT INTO Table1(callAnum,callBnum,callGDateTime,duration) VALUES

    (4436240307,96345,'2016-01-22 12:46:13.000',18)

    ,(4436240307,96345,'2016-01-23 13:21:11.000',9)

    ,(4436240307,96345,'2016-01-23 13:23:06.000',16)

    ,(4436240307,96345,'2016-01-23 13:25:29.000',5)

    ,(4436240307,96345,'2016-01-23 18:32:12.000',25)

    ,(4436240307,96345,'2016-01-24 17:23:19.000',7)

    ,(4436240307,96345,'2016-01-24 17:24:03.000',20)

    ,(4436240307,96345,'2016-01-25 13:32:34.000',16)

    ,(4436240307,96345,'2016-01-28 12:39:13.000',22)

    ,(4436240307,96345,'2016-01-28 13:12:57.000',5)

    ,(4436240307,96345,'2016-02-03 17:35:25.000',28)

    ,(4436240307,96345,'2016-02-14 19:24:08.000',19);

    CREATE TABLE Table2(

    callID_T2 INT IDENTITY(1, 1) NOT NULL

    ,callAnum BIGINT NOT NULL

    ,callBnum INTEGER NOT NULL

    ,callGDateTime DATETIME NOT NULL

    ,duration INTEGER NOT NULL

    );

    INSERT INTO Table2(callAnum,callBnum,callGDateTime,duration) VALUES

    (4436240307,96345,'2016-01-22 12:43:57.000',19)

    ,(4436240307,96345,'2016-01-23 13:19:03.000',9)

    ,(4436240307,96345,'2016-01-23 13:20:52.000',17)

    ,(4436240307,96345,'2016-01-23 13:23:25.000',6)

    ,(4436240307,96345,'2016-01-23 18:29:48.000',25)

    ,(4436240307,96345,'2016-01-24 17:21:14.000',7)

    ,(4436240307,96345,'2016-01-24 17:21:45.000',21)

    ,(4436240307,96345,'2016-01-25 13:30:19.000',17)

    ,(4436240307,96345,'2016-01-28 12:36:52.000',23)

    ,(4436240307,96345,'2016-01-28 13:10:53.000',6)

    ,(4436240307,96345,'2016-02-03 17:32:57.000',29)

    ,(4436240307,96345,'2016-02-14 19:21:49.000',20);

    If we then use Hugo's response

    SELECT *

    FROM Table1 AS t1

    CROSS APPLY

    (SELECT TOP(1) *

    FROM Table2 AS t2

    WHERE t2.callAnum = t1.callAnum

    AND t2.callBnum = t1.callBnum

    ORDER BY ABS(DATEDIFF(ms, t2.callGDAtetime, t1.callGDatetime))

    ) AS t2;

    this delivers the following result....note that rows 4 & 7 are each returned twice from Table 2.

    Is this what you want?

    +-------------------------------------------------------------------------------------------------------------------------------------------------+

    ¦ callID_T1 ¦ callAnum ¦ callBnum ¦ callGDateTime ¦ duration ¦ callID_T2 ¦ callAnum ¦ callBnum ¦ callGDateTime ¦ duration ¦

    ¦-----------+------------+----------+-------------------------+----------+-----------+------------+----------+-------------------------+----------¦

    ¦ 1 ¦ 4436240307 ¦ 96345 ¦ 2016-01-22 12:46:13.000 ¦ 18 ¦ 1 ¦ 4436240307 ¦ 96345 ¦ 2016-01-22 12:43:57.000 ¦ 19 ¦

    ¦ 2 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:21:11.000 ¦ 9 ¦ 3 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:20:52.000 ¦ 17 ¦

    ¦ 3 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:06.000 ¦ 16 ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:25.000 ¦ 6 ¦

    ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:25:29.000 ¦ 5 ¦ 4 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 13:23:25.000 ¦ 6 ¦

    ¦ 5 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 18:32:12.000 ¦ 25 ¦ 5 ¦ 4436240307 ¦ 96345 ¦ 2016-01-23 18:29:48.000 ¦ 25 ¦

    ¦ 6 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:23:19.000 ¦ 7 ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:21:45.000 ¦ 21 ¦

    ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:24:03.000 ¦ 20 ¦ 7 ¦ 4436240307 ¦ 96345 ¦ 2016-01-24 17:21:45.000 ¦ 21 ¦

    ¦ 8 ¦ 4436240307 ¦ 96345 ¦ 2016-01-25 13:32:34.000 ¦ 16 ¦ 8 ¦ 4436240307 ¦ 96345 ¦ 2016-01-25 13:30:19.000 ¦ 17 ¦

    ¦ 9 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 12:39:13.000 ¦ 22 ¦ 9 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 12:36:52.000 ¦ 23 ¦

    ¦ 10 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 13:12:57.000 ¦ 5 ¦ 10 ¦ 4436240307 ¦ 96345 ¦ 2016-01-28 13:10:53.000 ¦ 6 ¦

    ¦ 11 ¦ 4436240307 ¦ 96345 ¦ 2016-02-03 17:35:25.000 ¦ 28 ¦ 11 ¦ 4436240307 ¦ 96345 ¦ 2016-02-03 17:32:57.000 ¦ 29 ¦

    ¦ 12 ¦ 4436240307 ¦ 96345 ¦ 2016-02-14 19:24:08.000 ¦ 19 ¦ 12 ¦ 4436240307 ¦ 96345 ¦ 2016-02-14 19:21:49.000 ¦ 20 ¦

    +-------------------------------------------------------------------------------------------------------------------------------------------------+

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • GSquared (4/11/2016)


    If the business defines "closest in time" as what they want to match, then it's not wrong.

    I have to disagree with that in a lot of cases. I consider it a part of my job to protect the business from bad requirements.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • not sure what the OP wants....keeps peeping out from behind the sofa each day and logs on......but never replies 🙂

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply