Syntax help required

  • No mike, it's the same issue, I'm not doing a good job of explaining what I need, I'm going to have another go at it, so drop exapmlelog if you haven’t already.

    CREATE TABLE [dbo].[ExampleLog](

    [ID] [uniqueidentifier] NOT NULL,

    [EVENTNUMBER] [int] NULL,

    [DISPLAYSERIAL] [int] NULL,

    [DISPLAYID] [varchar](50) NULL,

    [OBJECTSERIAL] [int] NULL,

    [OBJECTRSSILEVEL] [int] NULL,

    [OBJECTID] [varchar](50) NULL,

    [TIMESTAMP] [datetime] NULL)

    INSERT INTO [dbo].[ExampleLog]

    VALUES

    (NewID(), 206, 6897913, 'HV 644', 6885819, 66, 'WAP 8', '2012-01-01 12:12:00'),

    (NewID(), 206, 6897913, 'HV 644', 6885819, 66, 'WAP 8', '2012-01-01 12:13:00'),

    (NewID(), 206, 6897913, 'HV 644', 6877807, 54, 'MV 20', '2012-01-01 12:16:00'),

    (NewID(), 103, 6897913, 'HV 644', 6898052, 118, 'HV 62', '2012-01-01 12:20:00'),

    (NewID(), 206, 6897913, 'HV 644', 6886130, 122, 'HV 62', '2012-01-01 12:21:00'),

    (NewID(), 206, 6897913, 'HV 644', 6886130, 106, 'HV 62', '2012-01-01 12:25:00'),

    (NewID(), 204, 6897913, 'HV 644', 6886130, 106, 'HV 62', '2012-01-01 12:31:00'),

    (NewID(), 204, 6897913, 'HV 644', 6886130, 106, 'HV 62', '2012-01-01 12:33:00'),

    (NewID(), 104, 6897913, 'HV 644', 6886130, 106, 'HV 62', '2012-01-01 12:39:00')

    I need to find a way of grouping rows that occur within 5 minutes of the latest timestamp in the group and the displayid and object id are the same, so using a common table expression on our sample data the expected results would look like

    RowIDEVENTNUMBERDISPLAYSERIALDISPLAYIDOBJECTSERIALOBJECTRSSILEVELOBJECTIDTIMESTAMPtmp_rank

    1A320ECEB-42A0-4A10-B97D-D5751895C0BE2066897913HV 644688581966WAP 82012-01-01 12:12:00.0001

    2CD89928B-80C5-4CAA-BEA5-D4135BE68D652066897913HV 644688581966WAP 82012-01-01 12:13:00.0002

    348873522-918E-46A8-B804-78CDE6487A582066897913HV 644687780754MV 202012-01-01 12:16:00.0001

    433EBDD97-51BB-40A5-9F78-265AF618835A1036897913HV 6446898052118HV 622012-01-01 12:20:00.0001

    516AFF3FA-39F7-456B-8A4C-275A08094C2B2066897913HV 6446886130122HV 622012-01-01 12:21:00.0002

    6504AF293-DE97-4010-A217-3F4225E1D1362066897913HV 6446886130106HV 622012-01-01 12:25:00.0003

    749545FA8-4100-47BF-8437-5D221F9E4A1D2046897913HV 6446886130106HV 622012-01-01 12:31:00.0001

    8955F4BEB-ADAD-489F-8913-4CF6824DBE6D2046897913HV 6446886130106HV 622012-01-01 12:33:00.0002

    991174B80-08E0-49A5-8B39-635B38859BD91046897913HV 6446886130106HV 622012-01-01 12:39:00.0001

    Rows 1 and 2 are grouped because the objectid and displayid are the same and row 2 occurs within 5 minutes of row 1, there are no other rows in the dataset that contain the same displayid and objectid and a timestamp within 5 minutes of the group’s latest timestamp (row 2) so..

    Row 3 is grouped by itself because there are no rows in the dataset where the displayid and objectid match within a 5 minute of that groups latest timestamp (row 3).

    Row 4, 5 and 6 are grouped because the displayid's match, the objectid's match and the rows timestamps are within 5 minutes of the last row, row 5's timestamp is within 5 minutes of row 4's, and row 6 timestamp is within 5 minutes of row 5's.

    Row 7 and 8 are grouped because the objectid's and displayid's are the same and row 8's timestamp is within 5 minutes of that groups latest timestamp (row 7).

    Row 9 is grouped by itself because there are no rows in the dataset where the displayid and objectid match within a 5 minute of that groups latest timestamp

  • WITH

    cte_remove AS (

    SELECT b.*

    FROM dbo.ExampleLog a

    INNER JOIN dbo.ExampleLog b ON

    a.ID != b.ID AND

    a.DISPLAYID = b.DISPLAYID AND

    a.OBJECTID = b.OBJECTID AND

    b.TIMESTAMP BETWEEN a.TIMESTAMP AND DATEADD(SECOND, 300, a.TIMESTAMP)

    ),

    cte_results AS (

    SELECT * FROM dbo.ExampleLog EXCEPT

    SELECT * FROM cte_remove

    )

    SELECT * FROM cte_results ORDER BY [timestamp]

    GO

    ??

    Substitute BETWEEN for >= AND < depending on your boundary semantics.

    And order the WHERE predicates to suit your indexing (I expect the a.ID != b.ID clause might explode the data across a large table like a log, unless you've already limited by DISPLAYID, OBJECTID and TIMESTAMP).

  • Thanks Jimbob, I ran this over a few sample datasets in dev this morning, and this query is returning exactly the results I am after, much appreciated

  • You're welcome. Make sure you test this on a sample dataset with a large number of rows, though, just to make sure we haven't introduced a performance time-bomb!

    Using a random test set based on the below, I was getting non-trivial query times from as little as 100,000 rows, even with a covering index on DISPLAYID, OBJECTID and TIMESTAMP and a primary key on ID:

    WITH

    E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), -- 10^1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 10^2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 10^4 or 10,000 rows

    E8(N) AS (SELECT 1 FROM E4 a, E4 b), -- 10^8 or 100,000,000 rows

    cte_rand AS (

    SELECT

    n

    ,a = NEWID()

    ,b = NEWID()

    ,c = NEWID()

    ,d = NEWID()

    FROM E8

    )

    INSERT INTO [dbo].[ExampleLog]

    SELECT TOP 100000-- vary this to alter the sample size

    id = a

    ,eventnumber =200 + (CONVERT(INT, CONVERT(VARBINARY(4), a)) % 10)

    ,displayserial =6897900 + ABS(CONVERT(INT, CONVERT(VARBINARY(4), b)) % 3)

    ,displayid =CASE ABS(CONVERT(INT, CONVERT(VARBINARY(4), b)) % 3)

    WHEN 0 THEN 'AB 123'

    WHEN 1 THEN 'CD 456'

    WHEN 2 THEN 'EF 789'

    END

    ,objectserial =6886800 + (8 * ABS(CONVERT(INT, CONVERT(VARBINARY(4), c)) % 3))

    ,objectrssilevel =100 + (8 * ABS(CONVERT(INT, CONVERT(VARBINARY(4), c)) % 3))

    ,objectid =CASE ABS(CONVERT(INT, CONVERT(VARBINARY(4), c)) % 3)

    WHEN 0 THEN 'ZZZ 1'

    WHEN 1 THEN 'YY 99'

    WHEN 2 THEN 'XX 88'

    END

    ,timestamp =DATEADD(SECOND, ABS(CONVERT(BIGINT, CONVERT(VARBINARY(4), d)) % (60 * 60 * 24 * 7)), CONVERT(DATETIME, '2012-01-01T00:00:00'))

    FROM cte_rand

    GO

    Maybe now, with my query defining the expected results set, one of the more performance-minded contributors can derive a more performant approach...

Viewing 4 posts - 16 through 18 (of 18 total)

You must be logged in to reply to this topic. Login to reply