October 6, 2006 at 6:43 pm
For instance, the PK is defined as follows:
KeyID = UniqueIdentifier (GUID)
FirstName = char(n)
LastName = char(n)
and there are other fields which may or may not be unique, for instance:
MoreInfo = char(n)
The data looks as follows:
{F73E32B9-D0DF-427E-A0EC-48EA51658A52} David Russell something
{7DEEBD8A-86D0-47A2-9217-B98255AF57E2} David Russell whatever
{0C277B36-6423-4C2A-9459-C49097EC28A9} David Russell whatever
{5C9C9DE1-6283-4F7E-997E-F1886973FAF9} John Smith something
I want to get rid of duplicate rows based on FirstName||LastName ONLY. I do not care about any of the additional information. But based on PK these rows are unique.
We got the job done; but it was messy, and we use GUIDs all over the place, so this is bound to be needed again. How about a three or four line solution? Anybody?
Thank-you,
David Russell
Any Cloud, Any Database, Oracle since 1982
October 6, 2006 at 8:03 pm
Firstly, KeyID looks to be always unique, thus could you not remove FirstName and LastName from the primary key and create a unique constraint on these two columns?
As for automatically removing them my concerns about that would be if the guid was used elsewhere referencing the one in this table you wouldn't be able to remove the row. Also of the duplicates how do you decide which one is the correct row of the two?
My solution to this problem however was to join the table to itself to find all the duplicates. Guids can be compared to each other in sql server for less than and greater than, by using this it is possible to exclude the highest/lowest guid from a set of results so that the server will not select one of the duplicates. However, the higher/lower seems to be quite arbitry when viewed against the string representation of the guid, but as long as you don't care which duplicate row is decided to be the unique row to be left it doesn't matter.
My solution
CREATE TABLE #People ( KeyID uniqueidentifier, FirstName varchar(20), LastName varchar(20) ) INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'David', 'Russell') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'David', 'Russell') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'David', 'Russell') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'David', 'Russell') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'John', 'Smith') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'John', 'Smith') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'John', 'Smith') INSERT INTO #People (KeyID, FirstName, LastName) VALUES (newid(), 'Joe', 'Doe') SELECT KeyID, FirstName, LastName FROM #People ORDER BY FirstName, LastName, KeyID --This is the select statement that does the finding of all but the 'first' duplicate SELECT DISTINCT b.KeyID, b.FirstName, b.LastName FROM #People AS a INNER JOIN #People AS b ON a.FirstName = b.FirstName AND a.LastName = b.LastName AND a.KeyID < b.KeyID ORDER BY FirstName, LastName, KeyID --the order isn't needed for when used for deleting, --just makes the results order the same as the above select DELETE FROM #People WHERE KeyID IN ( SELECT DISTINCT b.KeyID FROM #People AS a INNER JOIN #People AS b ON a.FirstName = b.FirstName AND a.LastName = b.LastName AND a.KeyID < b.KeyID ) SELECT KeyID, FirstName, LastName FROM #People ORDER BY FirstName, LastName, KeyID DROP TABLE #People
October 10, 2006 at 6:48 pm
Good job! thank you very much. It will go in my script library for future use. For anyone else who tries it, you will have to add the "b." to clarify the order by columns; but you get a very clear error where it is required.
This is not a trivial problem. Thanks again for your help.
David
Thank-you,
David Russell
Any Cloud, Any Database, Oracle since 1982
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply