October 4, 2024 at 5:00 pm
I have data that contains poll results, with the poll_id consistuting a single row of data for each candidate in said poll, so a poll_id will have a minimum of two different rows (not my design, parsed from ABC polling results). Sometimes there are more than 2 rows for each distinct poll_id as the survey may be graded multiple times. Whereas If the poll_id shows up more than twice, I want to take the average of these , the average of the Pct field, to list as the result. I'm wanting to do a derived column for this as truePct , but have a problem with HAVING filter, as it's not looking at the same poll_id, but all the poll_ids in the query (see the Select avg(pct) statement for truePct). ??? Thanks in advance.
SELECT poll_id ,start_date,end_date,
candidate_name, sample_size,
(select avg(pct) from [president_polls] where state = 'nevada' having count(poll_id) > 2) as truePct,
round([pct], 2) as Points
FROM president_polls
WHERE created_at >= '2024-9-1'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')
DDL =
CREATE TABLE [president_polls](
[poll_id] [int] IDENTITY(1,1) NOT NULL,
[state] [nvarchar](50) NULL,
[start_date] [date] NULL,
[end_date] [date] NULL,
[sample_size] [smallint] NULL,
[created_at] [datetime2](7) NULL,
[candidate_name] [nvarchar](50) NULL,
[pct] [float] NULL
)
October 4, 2024 at 9:42 pm
CORRECTION:
poll_id is not a primary key or unique value, sits in a local dev table
October 6, 2024 at 10:45 am
You have over 1000 points so should know how to post a question by now.
How to post a T-SQL question on a public forum | spaghettidba
October 7, 2024 at 7:08 am
Maybe this?:
SELECT poll_id ,start_date,end_date,
candidate_name, sample_size,
(select avg(pp2.pct) from [president_polls] pp2
where pp2.state = pp.state and pp2.poll_id = pp.poll_id
having count(*) > 2) as truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
October 8, 2024 at 3:11 pm
Maybe this?:
SELECT poll_id ,start_date,end_date,
candidate_name, sample_size,
(select avg(pp2.pct) from [president_polls] pp2
where pp2.state = pp.state and pp2.poll_id = pp.poll_id
having count(*) > 2) as truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')
You've been around long enough that you should know that a correlated subquery is never going to perform as well as a windowed function.
SELECT poll_id ,start_date,end_date,
candidate_name, sample_size,
AVG(pp.pct) OVER(PARTITION BY pp.[state], pp.poll_id) AS truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 8, 2024 at 5:15 pm
My primary goal is to meet the stated business requirements for the code, which I don't believe your code does. You will list the "truePct" for even a single row, which OP explicitly stated -- and showed -- that they did not want to do.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
October 8, 2024 at 5:30 pm
I'm not sure how we'd be able to determine whether more than 2 rows are present with that style of query, in order to meet the stated requirements. Then again, I'll admit I haven't spent much time thinking about that, since I'm very busy with other things. Hence, I tend to stick to a straight-line resolution to the problem presented, sometimes overly so.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
October 8, 2024 at 6:04 pm
My primary goal is to meet the stated business requirements for the code, which I don't believe your code does. You will list the "truePct" for even a single row, which OP explicitly stated -- and showed -- that they did not want to do.
I did miss that, but it's easy to fix. This is part of the reason that we ask for sample data and expected results. We have nothing to compare our solutions to, so it's easy to miss criteria.
SELECT poll_id ,start_date,end_date,
candidate_name, sample_size,
CASE WHEN COUNT(*) OVER(PARTITION BY pp.[State], pp.poll_id) > 1
THEN AVG(pp.pct) OVER(PARTITION BY pp.[state], pp.poll_id)
ELSE NULL
END AS truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 8, 2024 at 6:25 pm
It's "> 2" :-).
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
October 8, 2024 at 6:35 pm
As to performance:
I figure SQL would have to partition the data twice to get the COUNT() and the AVG() from different queries, but I'd need to look at the query plan to be sure, of course.
Overall, though, given the limited number of rows (*), I sincerely doubt you'd see much of a performance difference between methods. Particularly so if the table is properly clustered by ( created_at ) first (assuming all queries generally follow the pattern above). Admittedly that clustering may not be the case, and a nonclustered index may be used instead, which could end up damaging performance regardless of method.
(*) There are only 50 (or so) states and only so many polls you can conduct in a limited time window.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply