problem removing duplicates

  • i had a load of duplicate rows in a table.

    i have selected them all out to a temp table.

    i then deleted them all from the main table.

    now i want to delete the one of each of the duplicates in the temp table, leaving only one of each row to be

    inserted back into the main table.

    i have tried this:

    select      InvoiceGuid     

        ,CreditNoteID

               ,InvoiceHeaderID

               ,InvoiceBatchID

               ,BillToAccountName

               ,InvoiceNumber

    into #test2

    from #test

    where invoiceNumber in (select Distinct invoiceNumber from #test)

    but this simply places all the rows from the first temp table into the second temp table, and not just distinct rows

    as i had hoped. has anyone any ideas as to why this isnt working?

  • Try

    select  distinct    InvoiceGuid     

        ,CreditNoteID

               ,InvoiceHeaderID

               ,InvoiceBatchID

               ,BillToAccountName

               ,InvoiceNumber

    into #test2

    from #test

    Note that "select ... into ..." is fairly nasty syntax.  You' be better off (from a purist point of view only I suppose) doing

    insert into #test(....)

    select distinct .... from #test2

    Also - try to have a primary key on your table always   If you wanted to get rid of duplicates without dumping the contents of the table, you could add an identity column so that the previously duplicate rows are identical in all ways except for the identity column... But the dump out of table and insert back into table is also fine.

    Cheers!

  • is your InvoiceGuid unique as name suggests or it is also duplicated? IF it is then

    select InvoiceGuid

    ,max(CreditNoteID)

    ,max(InvoiceHeaderID)

    ,max(InvoiceBatchID)

    ,max(BillToAccountName)

    ,max(InvoiceNumber)

    into #test2

    from #test

    Group by InvoiceGuid

    #test2 will contain only unique values...

    Brij

  • Actually, I don't think this will work, Brij. Unfortunately I'm not sure what you mean by "If it is, then"... but if you're referring to the first part of the question (i.e. if Guid is unique), you will get all rows back. You need to do it the other way round - group by duplicated columns and MAX (or MIN) from the unique one.

    select MAX(InvoiceGuid)

    ,CreditNoteID

    ,InvoiceHeaderID

    ,InvoiceBatchID

    ,BillToAccountName

    ,InvoiceNumber

    into #test2

    from #test

    Group by CreditNoteID, InvoiceHeaderID, InvoiceBatchID, BillToAccountName, InvoiceNumber

    If GUID is also duplicated, i.e. the rows were duplicated fully and duplicate rows have the same value in each of the columns, simple SELECT DISTINCT will do the job.

  • Try This,

    Step-1

    ----------

    select     

    IDENTITY(int,1,1) as IdentityVal,

    InvoiceGuid     

        ,CreditNoteID

               ,InvoiceHeaderID

               ,InvoiceBatchID

               ,BillToAccountName

               ,InvoiceNumber

    into #test2

    from #test

    where invoiceNumber in (select Distinct invoiceNumber from #test)

    Select InvoiceGuid     

        ,CreditNoteID

               ,InvoiceHeaderID

               ,InvoiceBatchID

               ,BillToAccountName

               ,InvoiceNumber from #test2

    where IdentityVal in (select Max(IdentityVal)  from #test2

    Group by InvoiceGuid     

        ,CreditNoteID

               ,InvoiceHeaderID

               ,InvoiceBatchID

               ,BillToAccountName

               ,InvoiceNumber

     

  • here is the problem simplified:

     

    i have a table, with 2 cols in it

    invoiceID , invoiceNumber

    invoiceID is unique. there are duplicate invoiceNumbers. how do i delete the duplicates

    ( note the table has many columns, but these are the only columns of consequence)

    ive tried

    select invoiceID, MAX(invoiceNumber)

    from table

    group by invoiceID

    this returns everything. the same happens when i use distinct. how would you do this?

  • use

    select max(invoiceID), invoiceNumber

    from table

    group by invoiceNumber

    Brij

  • worked a treat, thanks so much. can someone explain precicely the logic behind this, as i feel that i have some misconceptions about how MAX is working in this suituation.

     

    thanks all

  • I'll have a go: when you use a GROUP BY clause, you specify that you only want one row returned for each distinct combination of values in the fields specified in the GROUP BY. In this case, you wanted exactly one row of data per distinct InvoiceNumber - you wanted data at InvoiceNumber granularity, we might say - so there was only one field, InvoiceNumber, specified in your GROUP BY clause.

    Once you have specified you grouping fields, any other field value you want to view in your query results now has to be transformed by an aggregate function, e.g. SUM, COUNT, MAX, AVG. This is because those columns now contain too many values (or there could be too many, so logically speaking there are too many). You need only one value per InvoiceNumber but you have more than that: the data has a higher granularity than you need. So the aggregate function is a way of 'folding up' the extra data to fit into the reduced number of rows.

    Of course, you haven't just added these extra ungrouped columns to give yourself extra problems. You want to see what's in them. So all the aggregate functions available do useful things like summing, counting, etc. There is no 'ANY()' aggregate, as in Access, which just picks one of the available values arbitrarily.

    In this case, you have used min(InvoiceID), which makes sense if your invoice IDs are assigned in ascending order over time - you will then be picking the earliest ID - or if they are in some other kind of meaningful order*. If there is no significance to the order of IDs, then you are effectively using MIN() as an Access-style ANY aggregate. You then need to ask - if I don't want to see any particular Inv ID, do I want to see an Inv ID at all?

    Hope this helps - let us know if you are unsure about anything.

    (*You wouldn't ideally rely on this kind of ordering, but instead MIN() on a proper date column, but to return a corresponding ID would require a subquery or messing around with hash fields which we won't go into here...)

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply