Need to group the same valued ID fields to run an aggregate average over them

  • I have data that contains poll results, with the poll_id consistuting a single row of data for each candidate in said poll, so a poll_id will have a minimum of two different rows (not my design, parsed from ABC polling results). Sometimes there are more than 2 rows for each distinct poll_id as the survey may be graded multiple times. Whereas If the poll_id shows up more than twice, I want to take the average of these , the average of the Pct field, to list as the result. I'm wanting to do a derived column for this as truePct , but have a problem with HAVING filter, as it's not looking at the same poll_id, but all the poll_ids in the query (see the Select avg(pct) statement for truePct). ??? Thanks in advance.

    SELECT  poll_id ,start_date,end_date,

    candidate_name, sample_size,

    (select avg(pct) from [president_polls] where state = 'nevada' having count(poll_id) > 2) as truePct,

    round([pct], 2) as Points

    FROM president_polls

    WHERE created_at >= '2024-9-1'

    AND state = 'nevada'

    AND candidate_name in ('Donald Trump','Kamala Harris')

    DDL =

    CREATE TABLE [president_polls](

    [poll_id] [int]  IDENTITY(1,1) NOT NULL,

    [state] [nvarchar](50) NULL,

    [start_date] [date] NULL,

    [end_date] [date] NULL,

    [sample_size] [smallint] NULL,

    [created_at] [datetime2](7) NULL,

    [candidate_name] [nvarchar](50) NULL,

    [pct] [float] NULL

    )

  • CORRECTION:

    poll_id is not a primary key or unique value, sits in a local dev table

  • You have over 1000 points so should know how to post a question by now.

    How to post a T-SQL question on a public forum | spaghettidba

  • Maybe this?:

    SELECT  poll_id ,start_date,end_date,
    candidate_name, sample_size,
    (select avg(pp2.pct) from [president_polls] pp2
    where pp2.state = pp.state and pp2.poll_id = pp.poll_id
    having count(*) > 2) as truePct,
    round([pct], 2) as Points
    FROM president_polls pp
    WHERE created_at >= '20240901'
    AND state = 'nevada'
    AND candidate_name in ('Donald Trump','Kamala Harris')

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher wrote:

    Maybe this?:

    SELECT  poll_id ,start_date,end_date,
    candidate_name, sample_size,
    (select avg(pp2.pct) from [president_polls] pp2
    where pp2.state = pp.state and pp2.poll_id = pp.poll_id
    having count(*) > 2) as truePct,
    round([pct], 2) as Points
    FROM president_polls pp
    WHERE created_at >= '20240901'
    AND state = 'nevada'
    AND candidate_name in ('Donald Trump','Kamala Harris')

    You've been around long enough that you should know that a correlated subquery is never going to perform as well as a windowed function.

    SELECT  poll_id ,start_date,end_date,
    candidate_name, sample_size,
    AVG(pp.pct) OVER(PARTITION BY pp.[state], pp.poll_id) AS truePct,
    round([pct], 2) as Points
    FROM president_polls pp
    WHERE created_at >= '20240901'
    AND state = 'nevada'
    AND candidate_name in ('Donald Trump','Kamala Harris')

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • My primary goal is to meet the stated business requirements for the code, which I don't believe your code does.  You will list the "truePct" for even a single row, which OP explicitly stated -- and showed -- that they did not want to do.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I'm not sure how we'd be able to determine whether more than 2 rows are present with that style of query, in order to meet the stated requirements.  Then again, I'll admit I haven't spent much time thinking about that, since I'm very busy with other things.  Hence, I tend to stick to a straight-line resolution to the problem presented, sometimes overly so.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher wrote:

    My primary goal is to meet the stated business requirements for the code, which I don't believe your code does.  You will list the "truePct" for even a single row, which OP explicitly stated -- and showed -- that they did not want to do.

    I did miss that, but it's easy to fix.  This is part of the reason that we ask for sample data and expected results.  We have nothing to compare our solutions to, so it's easy to miss criteria.

    SELECT  poll_id ,start_date,end_date,
    candidate_name, sample_size,
    CASE WHEN COUNT(*) OVER(PARTITION BY pp.[State], pp.poll_id) > 1
    THEN AVG(pp.pct) OVER(PARTITION BY pp.[state], pp.poll_id)
    ELSE NULL
    END AS truePct,
    round([pct], 2) as Points
    FROM president_polls pp
    WHERE created_at >= '20240901'
    AND state = 'nevada'
    AND candidate_name in ('Donald Trump','Kamala Harris')

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • It's "> 2" :-).

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • As to performance:

    I figure SQL would have to partition the data twice to get the COUNT() and the AVG() from different queries, but I'd need to look at the query plan to be sure, of course.

    Overall, though, given the limited number of rows (*), I sincerely doubt you'd see much of a performance difference between methods.  Particularly so if the table is properly clustered by ( created_at ) first (assuming all queries generally follow the pattern above).  Admittedly that clustering may not be the case, and a nonclustered index may be used instead, which could end up damaging performance regardless of method.

    (*) There are only 50 (or so) states and only so many polls you can conduct in a limited time window.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply