October 2, 2016 at 10:07 pm
Hi, I was hoping I could get some help for the below that I'm trying to return. I'm a SQL Server newbie!
i have a table that can be recreated with the below:
create table stat
(
gameID int
,playID int
,name nVarchar(10)
,eventType nVarChar(2)
)
insert into stat (gameId, playId, name, eventType) values
(1 , 1, 'Jim', 'a'),
(1 , 2, 'Frank', 'g'),
(1 , 3, 'Bill', 'f'),
(1 , 4, 'Peter', 'g'),
(1 , 5, 'Bob', 'v'),
(1 , 6, 'Tony', 'w'),
(1 , 7, 'Michael', 'g'),
(1 , 8, 'Simon', 's'),
(1 , 9, 'Gary', 'v'),
(2 , 1, 'Steve', 'g'),
(2 , 2, 'Philip', 'v'),
(2 , 3, 'Mark', 'e'),
(2 , 4, 'Bob', 'g'),
(2 , 5, 'Mark', 'x'),
(2 , 6, 'Tony', 'g'),
(2 , 7, 'Michael', 'n'),
(2 , 8, 'Steve', 'u'),
(2 , 9, 'Grant', 'v')
What im trying to extract is......
Whenever a "V" event occurs, I want to return the first occurrence of "Name" in a row before the "V" event where the Name has a "G" event.
For example......game id 1, play id 5 has an event "V".........i want to return Peter in the result as Peter has the first "G" event before "V".
Frank at game id 1, play id 2 shouldn't be returned as Peter had the first "G" event before the "V" event
I would like to see the following returned, the Name and how many times it has occurred for that name
Name Count
Peter 1
Michael 1
Steve 1
Tony 1
October 2, 2016 at 11:00 pm
Quick example, should be enough to get you started.
😎
USE TEEST;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'dbo.stat') IS NOT NULL DROP TABLE dbo.stat;
create table dbo.stat
(
gameID int
,playID int
,name nVarchar(10)
,eventType nVarChar(2)
);
insert into stat (gameId, playId, name, eventType) values
(1 , 1, 'Jim', 'a'),
(1 , 2, 'Frank', 'g'),
(1 , 3, 'Bill', 'f'),
(1 , 4, 'Peter', 'g'),
(1 , 5, 'Bob', 'v'),
(1 , 6, 'Tony', 'w'),
(1 , 7, 'Michael', 'g'),
(1 , 8, 'Simon', 's'),
(1 , 9, 'Gary', 'v'),
(2 , 1, 'Steve', 'g'),
(2 , 2, 'Philip', 'v'),
(2 , 3, 'Mark', 'e'),
(2 , 4, 'Bob', 'g'),
(2 , 5, 'Mark', 'x'),
(2 , 6, 'Tony', 'g'),
(2 , 7, 'Michael', 'n'),
(2 , 8, 'Steve', 'u'),
(2 , 9, 'Grant', 'v');
;WITH BASE_DATA AS
(
SELECT
ST.gameId
,ST.playId
,ST.name
,ST.eventType
,SUM(CASE
WHEN ST.eventType = 'v' THEN 1
ELSE 0
END) OVER
(
ORDER BY ST.gameId
,ST.playId
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS ST_GRP
,ROW_NUMBER() OVER
(
ORDER BY ST.gameId
,ST.playId
) AS ST_RID
FROM dbo.stat ST
WHERE ST.eventType IN ('v','g')
)
,SORTED_SET AS
(
SELECT
BD.gameId
,BD.playId
,BD.name
,BD.eventType
,BD.ST_GRP
,BD.ST_RID
,ROW_NUMBER() OVER
(
PARTITION BY BD.ST_GRP
ORDER BY BD.ST_RID DESC
) AS BD_GRP_RID
FROM BASE_DATA BD
)
SELECT
SS.playID
,SS.[name]
,COUNT(SS.playID) AS [Count]
FROM SORTED_SET SS
WHERE SS.BD_GRP_RID = 1
AND SS.eventType = 'g'
GROUP BY SS.playID
,SS.[name];
Output
playID name Count
----------- ---------- -----------
7 Michael 1
4 Peter 1
1 Steve 1
6 Tony 1
October 2, 2016 at 11:26 pm
That is incredible. I really appreciate your help, you have done an outstanding job with it, as a newbie there was no way I could have worked that out.
October 2, 2016 at 11:54 pm
watto84 (10/2/2016)
That is incredible. I really appreciate your help, you have done an outstanding job with it, as a newbie there was no way I could have worked that out.
You are very welcome.
😎
Now please study the code and come back if you have any questions.
October 3, 2016 at 8:34 am
I would think that you would want to partition by the game ID. Presumably, the games are treated separately, and partitioning by the game ID would accomplish this.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 3, 2016 at 7:33 pm
This sure does seem a LOT simpler, and easier for a newbie (and this old brain too!) to understand:
SELECT t2.name, COUNT(*)
FROM dbo.#stat t1
--get the G events for the V event that are earlier
INNER JOIN dbo.#stat t2 ON t1.gameid = t2.gameid AND t2.playID < t1.playID AND t2.eventtype = 'g'
--now ensure they are the most recent earlier G event
AND t2.playid = (SELECT MAX(playid) FROM #stat t3 WHERE t3.gameid = t1.gameid AND t3.playid < t1.playid AND t3.eventtype = 'g')
--get the V events
WHERE t1.eventtype = 'v'
GROUP BY t2.name
NOTE: if you cannot ALWAYS GUARANTEE at least one G preceeds every V then you need slightly different logic depending on how you want to handle that case
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 3, 2016 at 8:59 pm
That is very similiar to the code I tried putting together initially, so it makes me feel better that there is 1 solution that was close to what I was thinking.
I wasn't aware (and couldn't find when i searched) the SELECT MAX() syntax to get the most recent row for "G". Thank you for providing this.
I ran the code over the simple table that I provided and it worked great, however when applied to a larger table 500,000 rows it wasn't able to handle such a large data set.
Is there anyway to improve performance with the simple code provided, partition by perhaps?
And yes i can answer that there is always a G before a V.
October 3, 2016 at 10:52 pm
Problem solved on the large data set, i worked out I just needed to apply a clustered index to my table 🙂 Sorry.....Newbie 🙂
Thanks for your help though, easy to understand code that i can now apply to other similiar queries.
October 4, 2016 at 5:25 am
watto84 (10/3/2016)
Problem solved on the large data set, i worked out I just needed to apply a clustered index to my table 🙂 Sorry.....Newbie 🙂Thanks for your help though, easy to understand code that i can now apply to other similiar queries.
Some form of indexing would be necessary for good performance in any scenario. Table scans on larger numbers of rows can hurt - as you found out. 😀
I presume you picked the gameid, playid as clustered index, likely the primary key?? Then it serves several useful functions at once.
Kudos for being close to the solution yourself! The way to get there is to stop thinking about "looping" or "what do I want to do with this row, then the next one, etc" and start thinking "what to I want to do with the DATA" as a whole. Then just break down the wants/needs into logic you can code to. Easy-peasy, lemon-squeezy! 😎
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 4, 2016 at 8:03 am
TheSQLGuru (10/4/2016)
watto84 (10/3/2016)
Problem solved on the large data set, i worked out I just needed to apply a clustered index to my table 🙂 Sorry.....Newbie 🙂Thanks for your help though, easy to understand code that i can now apply to other similiar queries.
Some form of indexing would be necessary for good performance in any scenario. Table scans on larger numbers of rows can hurt - as you found out. 😀
I presume you picked the gameid, playid as clustered index, likely the primary key?? Then it serves several useful functions at once.
Kudos for being close to the solution yourself! The way to get there is to stop thinking about "looping" or "what do I want to do with this row, then the next one, etc" and start thinking "what to I want to do with the DATA" as a whole. Then just break down the wants/needs into logic you can code to. Easy-peasy, lemon-squeezy! 😎
Agreed. In fact, I keep such a notion in my signature line below... the one about stop thinking about what you want to do to a row...
--Jeff Moden
Change is inevitable... Change for the better is not.
October 4, 2016 at 9:06 am
I found another version. It's slower on this small dataset, but I think it will run faster on a larger dataset, because it contains one fewer sort.
WITH play_events AS (
SELECT *, MAX(CASE WHEN s.eventType = 'g' THEN CAST(s.playID AS VARBINARY(4)) + CAST(name AS VARBINARY(20)) END) OVER(PARTITION BY s.gameID ORDER BY s.playID ROWS UNBOUNDED PRECEDING) AS last_g_name
FROM #stat s
WHERE s.eventType IN ('g', 'v')
)
SELECT CAST(SUBSTRING(pe.last_g_name, 5, 20) AS NVARCHAR(10)) name, COUNT(*)
FROM play_events pe
WHERE pe.eventType = 'v'
GROUP BY CAST(SUBSTRING(pe.last_g_name, 5, 20) AS NVARCHAR(10))
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 4, 2016 at 9:17 am
TheSQLGuru (10/3/2016)
This sure does seem a LOT simpler, and easier for a newbie (and this old brain too!) to understand:
SELECT t2.name, COUNT(*)
FROM dbo.#stat t1
--get the G events for the V event that are earlier
INNER JOIN dbo.#stat t2 ON t1.gameid = t2.gameid AND t2.playID < t1.playID AND t2.eventtype = 'g'
--now ensure they are the most recent earlier G event
AND t2.playid = (SELECT MAX(playid) FROM #stat t3 WHERE t3.gameid = t1.gameid AND t3.playid < t1.playid AND t3.eventtype = 'g')
--get the V events
WHERE t1.eventtype = 'v'
GROUP BY t2.name
NOTE: if you cannot ALWAYS GUARANTEE at least one G preceeds every V then you need slightly different logic depending on how you want to handle that case
Unfortunately, this approach probably doesn't meet the criteria for such an approach to be very efficient. To be efficient requires that the number of records in the outer query be very small and the number in the subquery to be very large. From the data, it appears that the two numbers will be roughly in the same range.
Here are the stats from just this small set of data.
-- Eirikur's solution
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#stat_00000000000F'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- Drew's solution
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#stat_00000000000F'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- The SQL Guru's solution
Table '#stat_00000000000F'. Scan count 14, logical reads 28, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Drew
PS: I did shorten the table name just to get the relevant data to fit better.
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 4, 2016 at 9:30 am
drew.allen (10/4/2016)
TheSQLGuru (10/3/2016)
This sure does seem a LOT simpler, and easier for a newbie (and this old brain too!) to understand:
SELECT t2.name, COUNT(*)
FROM dbo.#stat t1
--get the G events for the V event that are earlier
INNER JOIN dbo.#stat t2 ON t1.gameid = t2.gameid AND t2.playID < t1.playID AND t2.eventtype = 'g'
--now ensure they are the most recent earlier G event
AND t2.playid = (SELECT MAX(playid) FROM #stat t3 WHERE t3.gameid = t1.gameid AND t3.playid < t1.playid AND t3.eventtype = 'g')
--get the V events
WHERE t1.eventtype = 'v'
GROUP BY t2.name
NOTE: if you cannot ALWAYS GUARANTEE at least one G preceeds every V then you need slightly different logic depending on how you want to handle that case
Unfortunately, this approach probably doesn't meet the criteria for such an approach to be very efficient. To be efficient requires that the number of records in the outer query be very small and the number in the subquery to be very large. From the data, it appears that the two numbers will be roughly in the same range.
Here are the stats from just this small set of data.
-- Eirikur's solution
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#stat_00000000000F'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- Drew's solution
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#stat_00000000000F'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- The SQL Guru's solution
Table '#stat_00000000000F'. Scan count 14, logical reads 28, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Drew
PS: I did shorten the table name just to get the relevant data to fit better.
I absolutely agree Drew. And this is one of the few times you will hear me say this but in this case performance simply must take a back seat. Most seasoned TSQL talent I know would have difficulty developing either of the other two solutions. Heck, a very large fraction would have difficulty explaining how they work, at least in detail. I presented a very logical example that teaches as well as gives the correct results.
The good news is that OP has some fast alternatives to use while he/she works to improve their data processing chops. 😎
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 4, 2016 at 10:27 am
The only time that performance takes a back seat for me is if accuracy is not otherwise able to be achieved.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 4, 2016 at 10:53 am
Jeff Moden (10/4/2016)
The only time that performance takes a back seat for me is if accuracy is not otherwise able to be achieved.
And you, sir, would be one of those "seasoned TSQL talents" that are an exception to my statement. 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply