Cleaning up duplicate data in one field??

Question

Cleaning up duplicate data in one field??

skailey

Mr or Mrs. 500

Points: 584
More actions
January 19, 2010 at 8:58 am

#156589

Hello...
Once upon a time, we had done a bulk insert into a table and somehow repeated the same data in the mailing address field without catching it. I am trying to clean these up and I am having a difficult time finding a clean way to look for all the records where this has happened.
For example:
MailingAddress varchar(40)
1016 ROCKY CREEK DRIVE 1016 ROCKY CREEK
1100 CASTLE ROCK VILLAS DRIVE 1100 CASTL
In these examples, you can see where it was somehow inserted twice up to the point where the data was maxed in the field. I need to keep the correct instance and clean up the extra characters. Any ideas?
Thank you!

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

matt6288 SSCertifiable Points: 5058 More actions · Answer 1

This solution assumes that the first "word" (the left characters up to the first space in the address) do not repeat in the actual address. See if this helps.

CREATE TABLE #temp (mailingAddress VARCHAR(40))

INSERT INTO #temp

SELECT '1016 ROCKY CREEK DRIVE 1016 ROCKY CREEK '

UNION ALL

SELECT '1100 CASTLE ROCK VILLAS DRIVE 1100 CASTL'

SELECT RTRIM(LEFT(mailingAddress,CHARINDEX(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)),mailingAddress,LEN(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress))))-1)), * FROM #temp T

DROP TABLE #temp

skailey Mr or Mrs. 500 Points: 584 More actions · Answer 2

That does seem to clean up my two examples, but what would be the best way to track down all the records in my db that are like this so I can insert them into the temp table for update?

matt6288 SSCertifiable Points: 5058 More actions · Answer 3

I would try this on a copy of your table first as I haven't tested this fully with all variations of addresses but couldn't you just update your table in place?

CREATE TABLE #temp (mailingAddress VARCHAR(40))

INSERT INTO #temp

SELECT '1016 ROCKY CREEK DRIVE 1016 ROCKY CREE'

UNION ALL

SELECT '1100 CASTLE ROCK VILLAS DRIVE 1100 CASTL'

UNION ALL

SELECT 'THIS ADDRESS SHOULD NOT BE CHANGED'

UPDATE #temp

SET mailingAddress = RTRIM(LEFT(mailingAddress,CHARINDEX(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)),mailingAddress,LEN(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress))))-1))

WHERE CHARINDEX(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)),mailingAddress,LEN(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)))) > 1

SELECT * FROM #temp

DROP TABLE #temp

Where #temp is your table name?

skailey Mr or Mrs. 500 Points: 584 More actions · Answer 4

Thanks Matt... This is going to work.

I did find other records that were duplicated a bit different. I am assuming I will need to tweak your code to change those.

example

1105 1105 Brookshire Ln

2450 2450 Middleset Trail

In these.. I would need to delete one of the numbers.

Craig@Work SSCrazy Points: 2346 More actions · Answer 5

... for each row

....... what is the index of the last character

........... of the largest substring

............... repeating no more than once

Is the above statement a basis for a solution to the problem?

i.e. Having found the above for each row, take the left LEFT(<address>, <lastindex>) and use this in the UPDATE? (if it it a large table of course, it may be better to INSERT into a new one, check the results, and then replace the old).

With the SUBSTRING function and an Auxiliary Table of Numbers, a scan can be created to calculate a substring combinations (starting from the begining of the address) for one, and then working backwards for the comparator (Would the article 'Optimising “Ends With” searches with REVERSE' published on this site help here?).

The first substring length should be >= to the second (which may be zero-length).

If previous replies did not solve the issue, would the above help as the basis for a solution (if I understood the problem correctly?)?.

matt6288 SSCertifiable Points: 5058 More actions · Answer 6

Yeah, the update statement above won't work and would actually change the addresses you just gave to be 1105 and 2450 respectively. If you are sure those are the only 2 you could just ignore them in a where clause but if you need to code for this case then you would have to change the current where clause.