problem removing duplicates

Question

Post reply

problem removing duplicates

winston Smith

SSCoach

Points: 19500
More actions
May 30, 2006 at 2:57 am

#113144

i had a load of duplicate rows in a table.
i have selected them all out to a temp table.
i then deleted them all from the main table.
now i want to delete the one of each of the duplicates in the temp table, leaving only one of each row to be
inserted back into the main table.
i have tried this:
select      InvoiceGuid
    ,CreditNoteID
           ,InvoiceHeaderID
           ,InvoiceBatchID
           ,BillToAccountName
           ,InvoiceNumber
into #test2
from #test
where invoiceNumber in (select Distinct invoiceNumber from #test)
but this simply places all the rows from the first temp table into the second temp table, and not just distinct rows
as i had hoped. has anyone any ideas as to why this isnt working?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Ian Yates SSCoach Points: 19738 More actions · Answer 1

Try

select distinct InvoiceGuid

,CreditNoteID

,InvoiceHeaderID

,InvoiceBatchID

,BillToAccountName

,InvoiceNumber

into #test2

from #test

Note that "select ... into ..." is fairly nasty syntax. You' be better off (from a purist point of view only I suppose) doing

insert into #test(....)

select distinct .... from #test2

Also - try to have a primary key on your table always If you wanted to get rid of duplicates without dumping the contents of the table, you could add an identity column so that the previously duplicate rows are identical in all ways except for the identity column... But the dump out of table and insert back into table is also fine.

Cheers!

brij_rauthan SSCommitted Points: 1521 More actions · Answer 2

is your InvoiceGuid unique as name suggests or it is also duplicated? IF it is then

select InvoiceGuid

,max(CreditNoteID)

,max(InvoiceHeaderID)

,max(InvoiceBatchID)

,max(BillToAccountName)

,max(InvoiceNumber)

into #test2

from #test

Group by InvoiceGuid

#test2 will contain only unique values...

Brij

Vladan SSC-Insane Points: 22072 More actions · Answer 3

Actually, I don't think this will work, Brij. Unfortunately I'm not sure what you mean by "If it is, then"... but if you're referring to the first part of the question (i.e. if Guid is unique), you will get all rows back. You need to do it the other way round - group by duplicated columns and MAX (or MIN) from the unique one.

select MAX(InvoiceGuid)

,CreditNoteID

,InvoiceHeaderID

,InvoiceBatchID

,BillToAccountName

,InvoiceNumber

into #test2

from #test

Group by CreditNoteID, InvoiceHeaderID, InvoiceBatchID, BillToAccountName, InvoiceNumber

If GUID is also duplicated, i.e. the rows were duplicated fully and duplicate rows have the same value in each of the columns, simple SELECT DISTINCT will do the job.

Ajit-297150 Mr or Mrs. 500 Points: 593 More actions · Answer 4

Try This,

Step-1

----------

select

IDENTITY(int,1,1) as IdentityVal,

InvoiceGuid

,CreditNoteID

,InvoiceHeaderID

,InvoiceBatchID

,BillToAccountName

,InvoiceNumber

into #test2

from #test

where invoiceNumber in (select Distinct invoiceNumber from #test)

Select InvoiceGuid

,CreditNoteID

,InvoiceHeaderID

,InvoiceBatchID

,BillToAccountName

,InvoiceNumber from #test2

where IdentityVal in (select Max(IdentityVal) from #test2

Group by InvoiceGuid

,CreditNoteID

,InvoiceHeaderID

,InvoiceBatchID

,BillToAccountName

,InvoiceNumber

winston Smith SSCoach Points: 19500 More actions · Answer 5

here is the problem simplified:

i have a table, with 2 cols in it

invoiceID , invoiceNumber

invoiceID is unique. there are duplicate invoiceNumbers. how do i delete the duplicates

( note the table has many columns, but these are the only columns of consequence)

ive tried

select invoiceID, MAX(invoiceNumber)

from table

group by invoiceID

this returns everything. the same happens when i use distinct. how would you do this?

brij_rauthan SSCommitted Points: 1521 More actions · Answer 6

use

select max(invoiceID), invoiceNumber

from table

group by invoiceNumber

Brij

winston Smith SSCoach Points: 19500 More actions · Answer 7

worked a treat, thanks so much. can someone explain precicely the logic behind this, as i feel that i have some misconceptions about how MAX is working in this suituation.

thanks all

stax68 SSChampion Points: 11711 More actions · Answer 8

I'll have a go: when you use a GROUP BY clause, you specify that you only want one row returned for each distinct combination of values in the fields specified in the GROUP BY. In this case, you wanted exactly one row of data per distinct InvoiceNumber - you wanted data at InvoiceNumber granularity, we might say - so there was only one field, InvoiceNumber, specified in your GROUP BY clause.

Once you have specified you grouping fields, any other field value you want to view in your query results now has to be transformed by an aggregate function, e.g. SUM, COUNT, MAX, AVG. This is because those columns now contain too many values (or there could be too many, so logically speaking there are too many). You need only one value per InvoiceNumber but you have more than that: the data has a higher granularity than you need. So the aggregate function is a way of 'folding up' the extra data to fit into the reduced number of rows.

Of course, you haven't just added these extra ungrouped columns to give yourself extra problems. You want to see what's in them. So all the aggregate functions available do useful things like summing, counting, etc. There is no 'ANY()' aggregate, as in Access, which just picks one of the available values arbitrarily.

In this case, you have used min(InvoiceID), which makes sense if your invoice IDs are assigned in ascending order over time - you will then be picking the earliest ID - or if they are in some other kind of meaningful order*. If there is no significance to the order of IDs, then you are effectively using MIN() as an Access-style ANY aggregate. You then need to ask - if I don't want to see any particular Inv ID, do I want to see an Inv ID at all?

Hope this helps - let us know if you are unsure about anything.

(*You wouldn't ideally rely on this kind of ordering, but instead MIN() on a proper date column, but to return a corresponding ID would require a subquery or messing around with hash fields which we won't go into here...)

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant