Better way of getting this result?

Question

Better way of getting this result?

kramaswamy

SSCoach

Points: 18181
More actions
July 20, 2012 at 2:11 pm

#278263

Pretty self-explanatory query. Basically, given a table that has transactions with a date, get a count of all the transactions which have four subsequent transactions from the same card within a 24-hour period.
I solved this problem using a subquery, but against a table which has hundreds of millions of rows, this is pretty slow. In my particular case this was just for a one-shot result, so its not really terribly important, but I was wondering if anyone knew of a better way of getting these results.
Note - I have no control over the actual table in question. The CardID field and the TranDate field both have indexes, but I can't do any more manipulation of them than that.
CREATE TABLE #Transactions
(
ID INT IDENTITY PRIMARY KEY,
CardID INT,
TranDate DATETIME
)
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 05:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-07-01 05:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (2, '2012-06-01 12:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (4, '2012-06-01 14:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-08-01 12:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (5, '2012-04-01 13:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-05-01 05:23:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 05:05:15')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (2, '2012-07-01 05:12:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-06-02 05:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 05:12:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (4, '2012-06-02 05:05:12')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-06-01 12:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (5, '2012-06-01 14:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 15:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-06-01 12:12:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (4, '2012-06-01 13:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (2, '2012-06-01 05:12:05')
SELECT
CardID,
TranDate,
(
SELECT COUNT(*)
FROM #Transactions t2
WHEREt2.CardID = t1.CardID
AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1
)
FROM #Transactions t1

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286590 More actions · Answer 1

You could use APPLY but on a small dataset it isn't likely to make much (if any) difference. The execution plan is only slightly different than your subquery approach.

SELECT

CardID,

TranDate, x.MyRowCount

FROM #Transactions t1

cross apply

(

SELECT COUNT(*) as MyRowCount

FROM #Transactions t2

WHEREt2.CardID = t1.CardID

AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

) x

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

Sean beat me to it, but here is a setup that also creates a 1,000,000 row data set to test.

CREATE TABLE #Transactions

(

ID INT IDENTITY PRIMARY KEY,

CardID INT,

TranDate DATETIME

);

create index CardID_TranDate on #Transactions (

CardID, TranDate);

go

with e1(n) as (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

e2(n) as (select 1 from e1 a cross join e1 b),

e4(n) as (select 1 from e2 a cross join e2 b),

e6(n) as (select row_number() over (order by (select null)) from e4 a cross join e2 b)

insert into #Transactions(CardID, TranDate)

select cast(floor(rand(checksum(newid())) * 10000) + 1 as int), dateadd(ss, floor(rand(checksum(newid())) * 31536000), '2012-01-01') from e6;

go

SELECT

CardID,

TranDate,

(

SELECT COUNT(*)

FROM #Transactions t2

WHEREt2.CardID = t1.CardID

AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

)

FROM #Transactions t1

ORDER BY CardID, TranDate;

go

select

t1.CardID,

t1.TranDate,

dt.CardCnt

from

#Transactions t1

cross apply (select t2.CardID, count(t2.TranDate) CardCnt

from #Transactions t2

where t2.CardID = t1.CardID and t2.TranDate between t1.TranDate and dateadd(dd, 1, t1.TranDate)

group by t2.CardID

having count(t2.TranDate) >= 4) dt

order by

t1.CardID,

t1.TranDate

;

go

--drop table #Transactions;

go

When I ran it on my virtual system (2 processors, 8 GB RAM) including returning the data set, 6 seconds.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

NOTE, the 6 seconds I talk about above is not for all the code, just the code returning records with a count of 4 or more in a 24 hour period.

kramaswamy SSCoach Points: 18181 More actions · Answer 4

Hmm. Hadn't thought of using APPLY. Thanks for the heads up, I'll give it a shot.

kramaswamy SSCoach Points: 18181 More actions · Answer 5

Cool - so I was messing around, and found that the best way of being able to get my actual results was to run an SSIS package that would export the data from my Oracle source into a SQL Server table. The output that I got was exactly the same as the one I presented above - a table with two columns, one being the card number, and the second being the transaction date.

I then put a primary key identity field on the table, and created two indexes - one on card number and tran date, and the other on tran date and card number.

I ran my original query, and it took 29 seconds to complete. I then tried your version Lynn, and it took 8seconds. However, just on a hunch, I tried a slightly modified version of my query, where I wrapped the query I posted with a CTE, then ran a SELECT against the CTE where the count was >= 4:

WITH cte AS

(

SELECT

CardID,

TranDate,

(

SELECT COUNT(*)

FROM #Transactions t2

WHEREt2.CardID = t1.CardID

AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1

) AS CardCnt

FROM #Transactions t1

)

SELECT * FROM cte WHERE CardCnt >= 4

That version took only 2 seconds to run! I'm not really sure why it performs so much better? My data set has ~850000 rows, so is it just because it takes the server an extra 26 seconds just to show those results, even though everything is indexed, and the processor essentially needs to do all the same work behind the scenes anyway?