Insuring all values from a parent subquery/CTE are present in a child query

  • I need a random sampling of 1% of a population of last names and spoken languages in a county. The lastname does not have to correspond to the person's language, immaterial. My problem is sometimes all the last names do not appear in the results. I need a way of making sure (even if it slightly exceeds 1%) that all the names are returned. I consistently get 167 names but sometimes rarer names like 'Poindexter' fall out of the results. If I run the query again 'Poindexter' may be present but 'Hermann' drops out. Any ideas? The end data needs to look something like:

    ID       language      lastname

    1515      Spanish        Heath

    1620     English         Edwards

    1452      Korean          Smith

    1350       Japanese       Smith

    with cte1 -- to build set of languages - all five need to appear
    as (select distinct language from county.reporting.dbo.2022),

    cte2 -- to build set of lastnames - all 35 must appear, some can be multiples
    as (select distinct lastname from county.reporting.dbo.2022),

    cte3 -- build final dataset prior to sampling
    as (
    select * from county.reporting.dbo.2022
    where exists (select * from cte1)

    union

    select * from county.reporting.dbo.2022
    where exists (select * from cte2)
    )

    select top 1 PERCENT * from cte3 order by newID();

    • This topic was modified 2 years, 3 months ago by  DaveBriCam.
  • First problem:

    This cte will return all rows of county.reporting.dbo.2022 whenever there are results in cte1 ! ( no where clause in exists query )

    Same for the second union part

    deduplicated by the fact that you are using UNION in stead of UNION ALL.

    cte3 -- build final dataset prior to sampling
    as (
    select * from county.reporting.dbo.2022
    where exists (select * from cte1)

    union

    select * from county.reporting.dbo.2022
    where exists (select * from cte2)
    )

    That's the logic of 1% of a set, if all languages must be present and you determine the 1% sorting a random uniqueidentifier ?

    ( btw since it is a linked server reference, guess how the engine will handle such query ???!!! )

    • This reply was modified 2 years, 3 months ago by  Johan Bijnens.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply