Unwanted duplicates in random sampling

Question

Unwanted duplicates in random sampling

SSCarpal Tunnel

Points: 4298

September 9, 2022 at 7:31 pm

I'm trying to determine why I'm getting several (about 20%) duplicate values in my 'pt.pat_fullname' field. I would think the DISTINCT would work. Any ideas?

drop table if exists #allclaims;
select distinct cl.clm_hdr_id
     , cl.clm_dos
     , cl.pat_id
into #allclaims
from dbo.vw_claims cl
where cl.clm_dos
between '2022-08-01' and '2022-08-31'
group by cl.clm_hdr_id
       , cl.clm_dos
       , cl.pat_id
order by cl.pat_id;

select top 500
       *
from
(
    select distinct
           pt.pat_fullname
         , cl.clm_hdr_id
         , cl.clm_dos as 'Service Date'
    from #allclaims                cl
        inner join dbo.tb_patients pt
            on cl.pat_id = pt.pat_id
) as a
order by newid();

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

ratbak Ten Centuries Points: 1382 More actions · Answer 1

What is your definition of duplicates in this scenario? Please provide an actual example of multiple rows that comprise a duplicate (obfuscated if necessary).

There should be only one row for any given set/combination of pat_fullname, clm_hdr_id, & "Service Date" from that query.

But it is perfectly valid for it to return the same pat_fullname for various combinations of clm_hdr_id & "Service Date". Such cases would not be duplicates.

DaveBriCam SSCarpal Tunnel Points: 4298 More actions · Answer 2

I need for the person's name to be unique. So one of the records for 'Reece, Genais' needs to be eliminated. It doesn't matter which one:

Jeffrey Williams SSC Guru Points: 90305 More actions · Answer 3

I see you have both distinct and group by in the same query - use one or the other, and since you are not using any aggregate functions then you would only need the distinct. You also don't need to include an ORDER BY - that is only valid when you have an IDENTITY column and you need the identity to be assigned based on that order.

Ideally, you would have another column in vw_claims that could be used to identify a single row. If so - then it would be better to use that column instead of using distinct. Given that you are looking for each distinct date of service for a claim - then you can and will get multiple rows per claim. It all depends on how that claim is billed out and the dates of service for each procedure performed.

With that said - there is no reason to insert into a temp table - and then select the top 500. It can all be done in a single query - and selecting from a derived table is just extra typing.

If the desired result is a distinct list of claims per patient - then you need to either remove the DOS or get the min/max date of service. If the desired result is all dates of service by claims per patient - then just use a group by.

select top (500)
       pt.pat_fullname
     , cl.clm_hdr_id
     , max(cl.clm_dos) as 'Service Date'
from dbo.vw_claims cl
inner join dbo.tb_patients pt
            on cl.pat_id = pt.pat_id
where cl.clm_dos
between '2022-08-01' and '2022-08-31'
group by cl.clm_hdr_id
       , pt.pat_fullname
order by newid();

select top (500)
       pt.pat_fullname
     , cl.clm_hdr_id
     , cl.clm_dos as 'Service Date'
from dbo.vw_claims cl
inner join dbo.tb_patients pt
            on cl.pat_id = pt.pat_id
where cl.clm_dos
between '2022-08-01' and '2022-08-31'
group by cl.clm_hdr_id
       , pt.pat_fullname
       , cl.clm_dos
order by newid();

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 4

It's funny how people rail against duplicates in a randomly selected set... it's a part of what "random" is.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeffrey Williams SSC Guru Points: 90305 More actions · Answer 5

DaveBriCam wrote:

I need for the person's name to be unique. So one of the records for 'Reece, Genais' needs to be eliminated. It doesn't matter which one:

Saw this after I posted above - those are not DISTINCT rows. The patient has multiple claims in the selected period - and stating it doesn't matter which one you gets means neither the DOS or claim header ID are important.

select top (500)
       pt.pat_fullname
     , clm_hdr = max(cl.clm_hdr_id)
     , service_date = max(cl.clm_dos)
from dbo.vw_claims cl
inner join dbo.tb_patients pt
            on cl.pat_id = pt.pat_id
where cl.clm_dos
between '2022-08-01' and '2022-08-31'
group by pt.pat_fullname
order by newid();

Of course, this won't work if you have 2 patients with the same name as one of them would be eliminated. If that is the case, change the group by to group by pt.pat_id.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

Jeffrey Williams SSC Guru Points: 90305 More actions · Answer 6

Jeff Moden wrote:

It's funny how people rail against duplicates in a randomly selected set... it's a part of what "random" is.

It's also funny what some people consider duplicates - even when all but one column is different.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs