Better way of getting this result?

  • Pretty self-explanatory query. Basically, given a table that has transactions with a date, get a count of all the transactions which have four subsequent transactions from the same card within a 24-hour period.

    I solved this problem using a subquery, but against a table which has hundreds of millions of rows, this is pretty slow. In my particular case this was just for a one-shot result, so its not really terribly important, but I was wondering if anyone knew of a better way of getting these results.

    Note - I have no control over the actual table in question. The CardID field and the TranDate field both have indexes, but I can't do any more manipulation of them than that.

    CREATE TABLE #Transactions

    (

    ID INT IDENTITY PRIMARY KEY,

    CardID INT,

    TranDate DATETIME

    )

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (1, '2012-06-01 05:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (1, '2012-07-01 05:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (2, '2012-06-01 12:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (4, '2012-06-01 14:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (1, '2012-08-01 12:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (5, '2012-04-01 13:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (3, '2012-05-01 05:23:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (1, '2012-06-01 05:05:15')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (2, '2012-07-01 05:12:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (3, '2012-06-02 05:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (1, '2012-06-01 05:12:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (4, '2012-06-02 05:05:12')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (3, '2012-06-01 12:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (5, '2012-06-01 14:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (1, '2012-06-01 15:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (3, '2012-06-01 12:12:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (4, '2012-06-01 13:05:05')

    INSERT INTO #Transactions (CardID, TranDate)

    VALUES (2, '2012-06-01 05:12:05')

    SELECT

    CardID,

    TranDate,

    (

    SELECT COUNT(*)

    FROM #Transactions t2

    WHEREt2.CardID = t1.CardID

    AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

    )

    FROM #Transactions t1

  • You could use APPLY but on a small dataset it isn't likely to make much (if any) difference. The execution plan is only slightly different than your subquery approach.

    SELECT

    CardID,

    TranDate, x.MyRowCount

    FROM #Transactions t1

    cross apply

    (

    SELECT COUNT(*) as MyRowCount

    FROM #Transactions t2

    WHEREt2.CardID = t1.CardID

    AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

    ) x

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean beat me to it, but here is a setup that also creates a 1,000,000 row data set to test.

    CREATE TABLE #Transactions

    (

    ID INT IDENTITY PRIMARY KEY,

    CardID INT,

    TranDate DATETIME

    );

    create index CardID_TranDate on #Transactions (

    CardID, TranDate);

    go

    with e1(n) as (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    e2(n) as (select 1 from e1 a cross join e1 b),

    e4(n) as (select 1 from e2 a cross join e2 b),

    e6(n) as (select row_number() over (order by (select null)) from e4 a cross join e2 b)

    insert into #Transactions(CardID, TranDate)

    select cast(floor(rand(checksum(newid())) * 10000) + 1 as int), dateadd(ss, floor(rand(checksum(newid())) * 31536000), '2012-01-01') from e6;

    go

    SELECT

    CardID,

    TranDate,

    (

    SELECT COUNT(*)

    FROM #Transactions t2

    WHEREt2.CardID = t1.CardID

    AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

    )

    FROM #Transactions t1

    ORDER BY CardID, TranDate;

    go

    select

    t1.CardID,

    t1.TranDate,

    dt.CardCnt

    from

    #Transactions t1

    cross apply (select t2.CardID, count(t2.TranDate) CardCnt

    from #Transactions t2

    where t2.CardID = t1.CardID and t2.TranDate between t1.TranDate and dateadd(dd, 1, t1.TranDate)

    group by t2.CardID

    having count(t2.TranDate) >= 4) dt

    order by

    t1.CardID,

    t1.TranDate

    ;

    go

    --drop table #Transactions;

    go

    When I ran it on my virtual system (2 processors, 8 GB RAM) including returning the data set, 6 seconds.

  • NOTE, the 6 seconds I talk about above is not for all the code, just the code returning records with a count of 4 or more in a 24 hour period.

  • Hmm. Hadn't thought of using APPLY. Thanks for the heads up, I'll give it a shot.

  • Cool - so I was messing around, and found that the best way of being able to get my actual results was to run an SSIS package that would export the data from my Oracle source into a SQL Server table. The output that I got was exactly the same as the one I presented above - a table with two columns, one being the card number, and the second being the transaction date.

    I then put a primary key identity field on the table, and created two indexes - one on card number and tran date, and the other on tran date and card number.

    I ran my original query, and it took 29 seconds to complete. I then tried your version Lynn, and it took 8seconds. However, just on a hunch, I tried a slightly modified version of my query, where I wrapped the query I posted with a CTE, then ran a SELECT against the CTE where the count was >= 4:

    WITH cte AS

    (

    SELECT

    CardID,

    TranDate,

    (

    SELECT COUNT(*)

    FROM #Transactions t2

    WHEREt2.CardID = t1.CardID

    AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

    ) AS CardCnt

    FROM #Transactions t1

    )

    SELECT * FROM cte WHERE CardCnt >= 4

    That version took only 2 seconds to run! I'm not really sure why it performs so much better? My data set has ~850000 rows, so is it just because it takes the server an extra 26 seconds just to show those results, even though everything is indexed, and the processor essentially needs to do all the same work behind the scenes anyway?

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply