July 20, 2012 at 2:11 pm
Pretty self-explanatory query. Basically, given a table that has transactions with a date, get a count of all the transactions which have four subsequent transactions from the same card within a 24-hour period.
I solved this problem using a subquery, but against a table which has hundreds of millions of rows, this is pretty slow. In my particular case this was just for a one-shot result, so its not really terribly important, but I was wondering if anyone knew of a better way of getting these results.
Note - I have no control over the actual table in question. The CardID field and the TranDate field both have indexes, but I can't do any more manipulation of them than that.
CREATE TABLE #Transactions
(
ID INT IDENTITY PRIMARY KEY,
CardID INT,
TranDate DATETIME
)
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 05:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-07-01 05:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (2, '2012-06-01 12:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (4, '2012-06-01 14:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-08-01 12:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (5, '2012-04-01 13:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-05-01 05:23:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 05:05:15')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (2, '2012-07-01 05:12:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-06-02 05:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 05:12:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (4, '2012-06-02 05:05:12')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-06-01 12:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (5, '2012-06-01 14:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (1, '2012-06-01 15:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (3, '2012-06-01 12:12:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (4, '2012-06-01 13:05:05')
INSERT INTO #Transactions (CardID, TranDate)
VALUES (2, '2012-06-01 05:12:05')
SELECT
CardID,
TranDate,
(
SELECT COUNT(*)
FROM #Transactions t2
WHEREt2.CardID = t1.CardID
AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1
)
FROM #Transactions t1
July 20, 2012 at 2:55 pm
You could use APPLY but on a small dataset it isn't likely to make much (if any) difference. The execution plan is only slightly different than your subquery approach.
SELECT
CardID,
TranDate, x.MyRowCount
FROM #Transactions t1
cross apply
(
SELECT COUNT(*) as MyRowCount
FROM #Transactions t2
WHEREt2.CardID = t1.CardID
AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1
) x
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
July 20, 2012 at 3:18 pm
Sean beat me to it, but here is a setup that also creates a 1,000,000 row data set to test.
CREATE TABLE #Transactions
(
ID INT IDENTITY PRIMARY KEY,
CardID INT,
TranDate DATETIME
);
create index CardID_TranDate on #Transactions (
CardID, TranDate);
go
with e1(n) as (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
e2(n) as (select 1 from e1 a cross join e1 b),
e4(n) as (select 1 from e2 a cross join e2 b),
e6(n) as (select row_number() over (order by (select null)) from e4 a cross join e2 b)
insert into #Transactions(CardID, TranDate)
select cast(floor(rand(checksum(newid())) * 10000) + 1 as int), dateadd(ss, floor(rand(checksum(newid())) * 31536000), '2012-01-01') from e6;
go
SELECT
CardID,
TranDate,
(
SELECT COUNT(*)
FROM #Transactions t2
WHEREt2.CardID = t1.CardID
AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1
)
FROM #Transactions t1
ORDER BY CardID, TranDate;
go
select
t1.CardID,
t1.TranDate,
dt.CardCnt
from
#Transactions t1
cross apply (select t2.CardID, count(t2.TranDate) CardCnt
from #Transactions t2
where t2.CardID = t1.CardID and t2.TranDate between t1.TranDate and dateadd(dd, 1, t1.TranDate)
group by t2.CardID
having count(t2.TranDate) >= 4) dt
order by
t1.CardID,
t1.TranDate
;
go
--drop table #Transactions;
go
When I ran it on my virtual system (2 processors, 8 GB RAM) including returning the data set, 6 seconds.
July 20, 2012 at 3:53 pm
NOTE, the 6 seconds I talk about above is not for all the code, just the code returning records with a count of 4 or more in a 24 hour period.
July 23, 2012 at 7:35 am
Hmm. Hadn't thought of using APPLY. Thanks for the heads up, I'll give it a shot.
July 23, 2012 at 9:12 am
Cool - so I was messing around, and found that the best way of being able to get my actual results was to run an SSIS package that would export the data from my Oracle source into a SQL Server table. The output that I got was exactly the same as the one I presented above - a table with two columns, one being the card number, and the second being the transaction date.
I then put a primary key identity field on the table, and created two indexes - one on card number and tran date, and the other on tran date and card number.
I ran my original query, and it took 29 seconds to complete. I then tried your version Lynn, and it took 8seconds. However, just on a hunch, I tried a slightly modified version of my query, where I wrapped the query I posted with a CTE, then ran a SELECT against the CTE where the count was >= 4:
WITH cte AS
(
SELECT
CardID,
TranDate,
(
SELECT COUNT(*)
FROM #Transactions t2
WHEREt2.CardID = t1.CardID
AND t2.TranDate BETWEEN t1.TranDate AND t1.TranDate + 1
) AS CardCnt
FROM #Transactions t1
)
SELECT * FROM cte WHERE CardCnt >= 4
That version took only 2 seconds to run! I'm not really sure why it performs so much better? My data set has ~850000 rows, so is it just because it takes the server an extra 26 seconds just to show those results, even though everything is indexed, and the processor essentially needs to do all the same work behind the scenes anyway?
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply