May 30, 2006 at 2:57 am
i had a load of duplicate rows in a table.
i have selected them all out to a temp table.
i then deleted them all from the main table.
now i want to delete the one of each of the duplicates in the temp table, leaving only one of each row to be
inserted back into the main table.
i have tried this:
select InvoiceGuid
,CreditNoteID
,InvoiceHeaderID
,InvoiceBatchID
,BillToAccountName
,InvoiceNumber
into #test2
from #test
where invoiceNumber in (select Distinct invoiceNumber from #test)
but this simply places all the rows from the first temp table into the second temp table, and not just distinct rows
as i had hoped. has anyone any ideas as to why this isnt working?
May 30, 2006 at 3:30 am
Try
select distinct InvoiceGuid
,CreditNoteID
,InvoiceHeaderID
,InvoiceBatchID
,BillToAccountName
,InvoiceNumber
into #test2
from #test
Note that "select ... into ..." is fairly nasty syntax. You' be better off (from a purist point of view only I suppose) doing
insert into #test(....)
select distinct .... from #test2
Also - try to have a primary key on your table always If you wanted to get rid of duplicates without dumping the contents of the table, you could add an identity column so that the previously duplicate rows are identical in all ways except for the identity column... But the dump out of table and insert back into table is also fine.
Cheers!
May 30, 2006 at 4:05 am
is your InvoiceGuid unique as name suggests or it is also duplicated? IF it is then
select InvoiceGuid
,max(CreditNoteID)
,max(InvoiceHeaderID)
,max(InvoiceBatchID)
,max(BillToAccountName)
,max(InvoiceNumber)
into #test2
from #test
Group by InvoiceGuid
#test2 will contain only unique values...
Brij
May 30, 2006 at 5:28 am
Actually, I don't think this will work, Brij. Unfortunately I'm not sure what you mean by "If it is, then"... but if you're referring to the first part of the question (i.e. if Guid is unique), you will get all rows back. You need to do it the other way round - group by duplicated columns and MAX (or MIN) from the unique one.
select MAX(InvoiceGuid)
,CreditNoteID
,InvoiceHeaderID
,InvoiceBatchID
,BillToAccountName
,InvoiceNumber
into #test2
from #test
Group by CreditNoteID, InvoiceHeaderID, InvoiceBatchID, BillToAccountName, InvoiceNumber
If GUID is also duplicated, i.e. the rows were duplicated fully and duplicate rows have the same value in each of the columns, simple SELECT DISTINCT will do the job.
May 30, 2006 at 6:30 am
Try This,
Step-1
----------
select
IDENTITY(int,1,1) as IdentityVal,
InvoiceGuid
,CreditNoteID
,InvoiceHeaderID
,InvoiceBatchID
,BillToAccountName
,InvoiceNumber
into #test2
from #test
where invoiceNumber in (select Distinct invoiceNumber from #test)
Select InvoiceGuid
,CreditNoteID
,InvoiceHeaderID
,InvoiceBatchID
,BillToAccountName
,InvoiceNumber from #test2
where IdentityVal in (select Max(IdentityVal) from #test2
Group by InvoiceGuid
,CreditNoteID
,InvoiceHeaderID
,InvoiceBatchID
,BillToAccountName
,InvoiceNumber
May 30, 2006 at 7:53 am
here is the problem simplified:
i have a table, with 2 cols in it
invoiceID , invoiceNumber
invoiceID is unique. there are duplicate invoiceNumbers. how do i delete the duplicates
( note the table has many columns, but these are the only columns of consequence)
ive tried
select invoiceID, MAX(invoiceNumber)
from table
group by invoiceID
this returns everything. the same happens when i use distinct. how would you do this?
May 30, 2006 at 7:58 am
use
select max(invoiceID), invoiceNumber
from table
group by invoiceNumber
Brij
May 30, 2006 at 10:42 am
worked a treat, thanks so much. can someone explain precicely the logic behind this, as i feel that i have some misconceptions about how MAX is working in this suituation.
thanks all
May 31, 2006 at 3:11 pm
I'll have a go: when you use a GROUP BY clause, you specify that you only want one row returned for each distinct combination of values in the fields specified in the GROUP BY. In this case, you wanted exactly one row of data per distinct InvoiceNumber - you wanted data at InvoiceNumber granularity, we might say - so there was only one field, InvoiceNumber, specified in your GROUP BY clause.
Once you have specified you grouping fields, any other field value you want to view in your query results now has to be transformed by an aggregate function, e.g. SUM, COUNT, MAX, AVG. This is because those columns now contain too many values (or there could be too many, so logically speaking there are too many). You need only one value per InvoiceNumber but you have more than that: the data has a higher granularity than you need. So the aggregate function is a way of 'folding up' the extra data to fit into the reduced number of rows.
Of course, you haven't just added these extra ungrouped columns to give yourself extra problems. You want to see what's in them. So all the aggregate functions available do useful things like summing, counting, etc. There is no 'ANY()' aggregate, as in Access, which just picks one of the available values arbitrarily.
In this case, you have used min(InvoiceID), which makes sense if your invoice IDs are assigned in ascending order over time - you will then be picking the earliest ID - or if they are in some other kind of meaningful order*. If there is no significance to the order of IDs, then you are effectively using MIN() as an Access-style ANY aggregate. You then need to ask - if I don't want to see any particular Inv ID, do I want to see an Inv ID at all?
Hope this helps - let us know if you are unsure about anything.
(*You wouldn't ideally rely on this kind of ordering, but instead MIN() on a proper date column, but to return a corresponding ID would require a subquery or messing around with hash fields which we won't go into here...)
Tim Wilkinson
"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply