May 30, 2011 at 4:29 pm
Hi
This may be a version of a gap and island type of problem...
I would like to merge records for a each user based on start and end dates and sum a column (amount) for that user.
The rules for merging rows are:
1) Merge where the start and end dates of a row are between the start and end dates of another row
2) Merge where the start date of one row is the next day of another row
ie the end date for the first row for user 1111 is the 2010-08-07 and the following row is the 2010-08-08
Here is the test data
IF OBJECT_ID('tempdb..#user') IS NOT NULL
DROP TABLE #user;
CREATE TABLE #user
(
user_id INT
, start_date DATETIME
, end_date DATETIME
, amount DECIMAL(10, 2)
);
--TEST DATA
INSERT INTO #user SELECT 1111 ,'2010-08-06', '2010-08-07', 0.84;
INSERT INTO #user SELECT 1111 ,'2010-08-08', '2010-08-08', 4.73;
INSERT INTO #user SELECT 1111 ,'2010-08-09', '2010-08-09', 0.93;
INSERT INTO #user SELECT 1111 ,'2010-08-27', '2010-08-27', 2.40;
INSERT INTO #user SELECT 2222 ,'2010-08-09', '2010-08-09', 0.33;
INSERT INTO #user SELECT 2222 ,'2010-08-29', '2010-08-30', 2.10;
INSERT INTO #user SELECT 3333 ,'2010-08-08', '2010-08-08', 1.13;
INSERT INTO #user SELECT 3333 ,'2010-08-09', '2010-08-09', 0.73;
Here is the desired output from the dataset
--user_id start_date end_date amount
--1111 2010-08-06 2010-08-09 6.50
--1111 2010-08-27 2010-08-27 2.40
--2222 2010-08-09 2010-08-09 0.33
--2222 2010-08-29 2010-08-30 2.10
--3333 2010-08-08 2010-08-09 1.86
I would appreciate any assistance in terms of code and or links...
Cheers
June 1, 2011 at 7:16 pm
OK....
I managed to find a solution... however it doesn't seem very elegant to me....
I would welcome any input to make the query more concise or more efficient.
WITH cteContiguous
AS
(
--define the date range for contiguous canidates
SELECT
t.user_id,
MIN(t.start_date) AS start_date,
MAX(t2.end_date) AS end_date
FROM #user t
JOIN #user t2
ON t.user_id = t2.user_id
AND t.end_date + 1 = t2.start_date
WHERE t2.user_id IS NOT NULL
GROUP BY t.user_id
), cteBase
AS
(
SELECT
t.user_id,
t.start_date,
t.end_date
FROM #user t
--exclude contiguous rows returned in cte
WHERE NOT EXISTS
(
SELECT
user_id,
start_date,
end_date
FROM cteContiguous
WHERE cteContiguous.user_id = t.user_id
AND t.start_date BETWEEN cteContiguous.start_date AND cteContiguous.end_date
)
)
SELECT
cteContiguous.user_id,
cteContiguous.start_date,
cteContiguous.end_date,
SUM(t.amount) AS amount
FROM cteContiguous
JOIN #user t
ON t.user_id = cteContiguous.user_id
AND t.start_date BETWEEN cteContiguous.start_date AND cteContiguous.end_date
GROUP BY
cteContiguous.user_id,
cteContiguous.start_date,
cteContiguous.end_date
UNION ALL
SELECT
cteBase.user_id,
cteBase.start_date,
cteBase.end_date,
SUM(t.amount) AS amount
FROM cteBase
JOIN #user t
ON t.user_id = cteBase.user_id
AND t.start_date BETWEEN cteBase.start_date AND cteBase.end_date
GROUP BY
cteBase.user_id,
cteBase.start_date,
cteBase.end_date
ORDER BY
user_id,
start_date,
end_date
June 2, 2011 at 2:17 am
You might want to have a look at this blog. It will demonstrate an alternative solution that most probably will perform better than your current approach.
June 2, 2011 at 6:45 am
Go here and drop down the Publish Date list and get a copy of the March SolidQ Journal. Read Itzik Ben-Gan's Packing Intervals article. STUNNINGLY efficient solution for what you need. He was able to collapse 5M rows of hot data in just THREE SECONDS!
http://www.solidq.com/sqj/Pages/Home.aspx
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 3, 2011 at 12:10 am
Thanks Lutz and Kevin...
This is a double WIN for me... not only did you point me to the exact solution I was looking for
but you also introduced me to SolidQ. Their Journal is going to be another invaluable SQL Server resource
and provide me with alot of interesting reading for this weekend.
Cheers
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply