How to delete duplicates updateing related tables?

  • I have two tables

    tbl_tracks

    TrackID (identity)

    TitleID (int)

    ArtistID (int)

    tbl_disc_tracks

    DiscTrackID (identity)

    TrackID (int)

    DiscId (int)

    I want to look in tbl_tracks for rows where titleID and ArtistID are the same for each row. I then want to get the min TrackID and update tbl_disc_tracks before deleting the duplicate rows.

    with my update I update all the none min values but im unsure as to what I should do for my TrackID

    
    
    UPDATE tbl_disc_tracks SET TrackID = ?????
    WHERE (TrackID IN
    (SELECT tbl_Temp.TrackID
    FROM tbl_Tracks AS tbl_temp
    WHERE EXISTS
    (SELECT ArtistID, TitleID, COUNT(TrackID)
    FROM tbl_Tracks
    WHERE tbl_Tracks.ArtistID = tbl_Temp.ArtistID AND tbl_Tracks.TitleID = tbl_Temp.TitleID
    GROUP BY tbl_Tracks.ArtistID, tbl_Tracks.TitleID
    HAVING COUNT(tbl_Tracks.TrackID) > 1))) AND (TrackID NOT IN
    (SELECT MIN(TrackID)
    FROM tbl_Tracks AS tbl_temp
    WHERE EXISTS
    (SELECT ArtistID, TitleID, COUNT(TrackID)
    FROM tbl_Tracks
    WHERE tbl_Tracks.ArtistID = tbl_Temp.ArtistID AND tbl_Tracks.TitleID = tbl_Temp.TitleID
    GROUP BY tbl_Tracks.ArtistID, tbl_Tracks.TitleID
    HAVING COUNT(tbl_Tracks.TrackID) > 1)
    GROUP BY ArtistID, TitleID))
    ORDER BY ArtistID

    then to delete delete all none min values Ive got ......

    
    
    DELETE FROM
    FROM tbl_tracks
    WHERE (TrackID IN
    (SELECT tbl_Temp.TrackID
    FROM tbl_Tracks AS tbl_temp
    WHERE EXISTS
    (SELECT ArtistID, TitleID, COUNT(TrackID)
    FROM tbl_Tracks
    WHERE tbl_Tracks.ArtistID = tbl_Temp.ArtistID AND tbl_Tracks.TitleID = tbl_Temp.TitleID
    GROUP BY tbl_Tracks.ArtistID, tbl_Tracks.TitleID
    HAVING COUNT(tbl_Tracks.TrackID) > 1))) AND (TrackID NOT IN
    (SELECT MIN(TrackID)
    FROM tbl_Tracks AS tbl_temp
    WHERE EXISTS
    (SELECT ArtistID, TitleID, COUNT(TrackID)
    FROM tbl_Tracks
    WHERE tbl_Tracks.ArtistID = tbl_Temp.ArtistID AND tbl_Tracks.TitleID = tbl_Temp.TitleID
    GROUP BY tbl_Tracks.ArtistID, tbl_Tracks.TitleID
    HAVING COUNT(tbl_Tracks.TrackID) > 1)
    GROUP BY ArtistID, TitleID))
    ORDER BY ArtistID
  • Not sure what you mean about doing the update.

    If there are duplicate title and artists, are you trying to get the trackid values in the child table set to the min(trackid)?

    Won't this give you duplicates? I would think you want to remove the duplicates in the child first, then remove the parent. IS this not the case?

    Steve Jones

    sjones@sqlservercentral.com

    http://www.sqlservercentral.com/columnists/sjones

  • I think you're making this overly complex for yourself. Try this:

    
    
    UPDATE tbl_disc_tracks
    SET TrackID = (SELECT MIN(TrackID)
    FROM tbl_Tracks
    WHERE tbl_Tracks.ArtistID = tbl_disc_tracks.ArtistID
    AND tbl_Tracks.TitleID = tbl_disc_tracks.TitleID)

    Obviously, this will update ALL rows, but aside from performance concerns, this won't give you any errant data. If performance is a major concern, i.e. if tbl_disc_tracks is a very large table, you could add the following WHERE clause to the update statement:

    
    
    WHERE EXISTS (SELECT tbl_Tracks.TrackID
    FROM tbl_Tracks
    WHERE tbl_Tracks.ArtistID = tbl_disc_tracks.ArtistID
    AND tbl_Tracks.TitleID = tbl_disc_tracks.TitleID
    GROUP BY tbl_Tracks.TrackID
    HAVING COUNT(*) > 1)

    (I believe I got that right)

    Then if all you're trying to do is delete

    rows in tbl_Tracks where the TrackID is not the minimum trackID, you don't even need to join to tbl_disc_tracks:

    
    
    DELETE tbl_Tracks
    WHERE TrackID > (SELECT MIN(TrackID)
    FROM tbl_Tracks AS A
    WHERE tbl_Tracks.ArtistID = A.ArtistID
    AND tbl_Tracks.TitleID = A.TitleID)

    I hope this helps.

    Matthew Burr

  • it is the tblTracks that have duplicates.

    e.g.

    TrackID, ArtistID, TitleID

    1, 1, 1

    2, 1, 2

    3, 2, 3

    4, 2, 3

    5, 2, 1

    In this example I would want to delete the row trackID = 4 and keep TrackID = 3.

    But because I relate to it in my tbl_Disc_Tracks I need to replace TrackID's in

    tbl_Disc_Tracks where the TrackID is 4 with 3.

    would i be better with a cursor and doing a loop ?

    e.g.

    UPDATE tbl_disc_tracks SET TrackID = 3 WHERE TrackID = 4?

    I was trying to avoid this.

    I wanted to try and do the update in one query then the delete in another.

    My original DELETE query creates two lists. one of all the duplicates then one of all the duplicate minimum ID's. it says delete the ones IN the duplicates but and NOT IN the minimum. this bit works, im lost with my UPDATE.

    hope that clears it up.

  • Look a little more closely at the response. Let's look at your first problem: you need to update tbl_disc_tracks so that the value for the TrackId is the value of the track that you are keeping in tbl_Tracks, and the value that you are keeping in tbl_Tracks is the minimum value of a set of duplicates; therefore, the value that you need for TrackID is the the minimum TrackID from the set of duplicates in tbl_tracks. So, first of all: how do you define duplicates? In your case, they are rows that have the same TitleID and the same ArtistID. Therefore, by grouping on these two columns, you will "merge" those duplicates into one row. And how do you get the minimum value for TitleID? By using the MIN aggregate function. Thus, the query:

    
    
    SELECT TitleID, ArtistID, MIN(TrackID)
    FROM tbl_Tracks
    GROUP BY TitleID, ArtistID

    This returns the minimum trackid for every TitleID, ArtistID combination in your table.

    Now, you want to take that value and update tbl_Disc_Tracks so that that value is in the TrackID column, because you are going to delete the other duplicate rows out of tbl_Tracks, and in order to maintain a relationship between the two tables, tbl_Disc_Tracks.TrackID must match what exists in tbl_Tracks for a given TitleID, ArtistID combination. So, first the basic update statement:

    
    
    UPDATE tbl_Disc_Tracks
    SET TrackID = (SELECT tbl_Tracks.TitleID, tbl_Tracks.ArtistID, MIN(TrackID)
    FROM tbl_Tracks
    GROUP BY tbl_Tracks.TitleID, ArtistID)

    *Note: the above code won't work; this is just conceptual.

    We've used our select statement that gives us the TrackID for the duplicate record that we'll be keeping, and we're going to update tbl_Disc_Tracks so that the TrackID in that table matches the TrackID of the record we're keeping; but notice that right now, there's no correlation between our select statement and our update statement. We must alter our statement thusly:

    
    
    UPDATE tbl_Disc_Tracks
    SET TrackID = (SELECT MIN(TrackID)
    FROM tbl_Tracks
    WHERE tbl_Disc_Tracks.TitleID = tbl_Tracks.TitleID
    AND tbl_Disc_Tracks.ArtistID = tbl_Tracks.ArtistID)

    Notice that in this case, I removed the GROUP BY. Think of what this query does. First, it SELECTS all of the data from tbl_Disc_Tracks. Next, starting with the first row, it finds all of the rows in tbl_Tracks that have the corresponding ArtistID and TitleID. Now, working with that set of matching rows from tbl_Tracks, it finds the smallest TitleID among them, and it returns that value. Finally, it takes that value and it updates Title_ID in tbl_Disc_Tracks so that it has that value. Later, you'll be deleting all of the duplicate rows in tbl_Tracks, leaving only the row that has this smallest TrackID; so, this is the value you want in tbl_Disc_Tracks.TitleID. Thus, we've met your first requirement; we've updated tbl_Disc_Tracks with the minimum trackID before deleting the duplicates from tbl_Tracks.

    Now, as for deleting the duplicates. You have decided that you want to delete the rows that do not have the minimum trackID. So we'll first find the minimum trackID, as we did before:

    
    
    SELECT TitleID, ArtistID, MIN(TrackID)
    FROM tbl_Tracks
    GROUP BY TitleID, ArtistID

    Next, our basic delete statement:

    
    
    DELETE tbl_Tracks
    WHERE TitleID ????

    And how do we finish that WHERE clause? We're looking to delete rows with TitleID's that are greater than the minimum titleID, so:

    
    
    WHERE TitleID > (SELECT MIN(TrackID)
    FROM tbl_Tracks AS A
    WHERE tbl_Tracks.TitleID = A.TitleID
    AND tbl_Tracks.ArtistID = A.ArtistID)

    See once again how we've established a correlation between our query that selects the minimum value and our delete query? This meets your second requirement: it deletes all of the duplicates, saving only the rows that have the minimum TrackID.

    So, give it a shot. Copy your data to a test environment and ensure that it does what you're trying to do. I'm sure you'll see that it does.

    Thanks,

    Matthew Burr

    Edited by - mdburr on 10/27/2002 12:13:39 AM

  • quote:


    WHERE tbl_Disc_Tracks.TitleID = tbl_Tracks.TitleID AND tbl_Disc_Tracks.ArtistID = tbl_Tracks.ArtistID)


    these two tables are only related by trackID. in tbl_Disc_Tracks there is no artist or title ID only a trackID.

  • Okay. I don't know how I missed that. Anyway, let's start by filtering down the set of rows you want to UPDATE:

    
    
    UPDATE tbl_Disc_Tracks
    SET TrackID = ????
    WHERE EXISTS (SELECT *
    FROM tbl_Tracks AS A
    JOIN tbl_Tracks AS B
    ON A.TitleID = B.TitleID
    AND A.ArtistID = B.ArtistID
    AND A.TrackID <> B.TrackID
    WHERE A.TrackID = tbl_Disc_Tracks.TrackID)

    I've used a somewhat different approach. I'm identifying duplicates as situations where rows exist that have the same TitleID and ArtistID but different TrackIDs when I do a self-join on tbl_Tracks. I'm correlating this back to the tbl_Disc_Tracks that we're updating.

    Now, the real part of your question: what do we fill-in for the SET?

    How about this:

    
    
    SET TrackID = (SELECT MIN_TRACK_ID
    FROM tbl_Tracks AS A
    JOIN (SELECT TitleID, ArtistID, MIN(TrackID)
    FROM tbl_Tracks
    GROUP BY TitleID, ArtistID) AS B (TitleID, ArtistID, MIN_TRACK_ID)
    ON A.TitleID = B.TitleID
    AND A.ArtistID = B.ArtistID
    WHERE A.TrackID = tbl_Disc_Tracks.TrackID)

    So, we're finding the row in tbl_Tracks that has the same TrackID as the given row in tbl_Disc_Tracks. We're then using the TitleID and ArtistID row to join back to a subquery on tbl_Tracks that groups all the records by TitleID and ArtistID and provides their minimum TrackID. We're feeding this minimum trackID back to the SET clause.

    The DELETE statement I provided should still be fine.

    Sorry for the confusion; I hope this solution works for you.

    Matthew Burr

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply