March 29, 2018 at 10:06 am
Our database has a lot of duplicate records (one person with more than one record). I'd like to write a script so that it pulls non-dupe records and only one of the dupe records for people that have more than one record based on the last time it was modified. I have a table (data from an external source) that is only one person and one record and I'd like to compare the email addresses from that table to the email address field in a table that our in house app contains (the table with people that have more than one record).
Here's some arbitrary data of what I'm referring to....
ID EMAIL LAST_UPDATED_DT
10 jane@abc.com 2011-01-05
13 jane@abc.com 2012-06-15
17 john@bcd.com 2011-01-10
22 jill@cde.com 2015-06-01
23 jill@cde.com 2011-12-01
30 joe@def.com 2012-02-01
32 pat@efg.com 2016-03-02
39 pat@efg.com 2017-09-12
For the sake of example, let's assume every single unique email address above is in the other table (data from an external source) that's providing the list of unique email addresses. That actually won't be the case but I'd rather keep this message on the shorter side.
This is what I'd like the result of the query to look like...
ID EMAIL
13 jane@abc.com
17 john@bcd.com
22 jill@cde.com
30 joe@def.com
39 pat@efg.com
While I've spent time Googling, I can't find a script that does this or examples of scripts to help me get the desired result. Basically, when the script encounters a group of 2 or more records based on email address being the same, I want it to sort and use the record that was last updated with the most recent date.
Even if you don't have time to give a sample script, I'd love to learn how to put this together on my own so if I could get some key words on what to search for, I'd appreciate it. I do have a script that finds dupes and gives a count, I just don't know how to pick a single record out of a grouping. Also, the script I have only finds dupes and that's not what I want to do in this case.
TIA,
John
March 29, 2018 at 10:10 am
If you know ID is a sequential int that should be pretty straight forward.
SELECT MAX(ID), EMAIL, MAX(LAST_UPDATED_DT) FROM wherever GROUP BY EMAIL
March 29, 2018 at 10:10 am
Your going to want to use GROUP BY in a CTE to group by the e-mail address, and select that along with the MAX(LAST_UPDATED_DT). Then you can join the CTE name to the original table name by e-mail and by LAST_UPDATED_DT = whatever column name you gave the MAX value. Does that help you get it going?
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
March 29, 2018 at 10:15 am
Sorry...I should've posted this in the T-SQL section.
Anyway, while the ID is a number (stored as a varchar(15) data type), it's not always going to be the largest ID number. Sometimes it's going to be a smaller ID number. Sorry, I should've reflected that in my example. I'll go ahead and give this a try then...
SELECT ID, EMAIL, MAX(LAST_UPDATED_DT) FROM wherever GROUP BY EMAIL
Thanks,
John
March 29, 2018 at 10:16 am
sgmunson - Thursday, March 29, 2018 10:10 AMYour going to want to use GROUP BY in a CTE to group by the e-mail address, and select that along with the MAX(LAST_UPDATED_DT). Then you can join the CTE name to the original table name by e-mail and by LAST_UPDATED_DT = whatever column name you gave the MAX value. Does that help you get it going?
Yes, this does help. I'll give it a try and I'll report back.
March 29, 2018 at 10:22 am
J M-314995 - Thursday, March 29, 2018 10:15 AMSorry...I should've posted this in the T-SQL section.Anyway, while the ID is a number (stored as a varchar(15) data type), it's not always going to be the largest ID number. Sometimes it's going to be a smaller ID number. Sorry, I should've reflected that in my example. I'll go ahead and give this a try then...
SELECT ID, EMAIL, MAX(LAST_UPDATED_DT) FROM wherever GROUP BY EMAILThanks,
John
Don't include ID in the GROUP BY query. That's going to return all the different ID values and you need to get down to just one e-mail address and it's maximum date. You can do this:WITH GROUPED_DATA AS (
SELECT EMAIL, MAX(LAST_UPDATED_DT) AS MAX_UPDATE
FROM YOUR_TABLE_NAME_HERE
GROUP BY EMAIL
)
SELECT T.ID, G.EMAIL, G.MAX_UPDATE
FROM GROUPED_DATA AS G
INNER JOIN YOUR_TABLE_NAME_HERE AS T
ON G.EMAIL = T.EMAIL;
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
March 29, 2018 at 10:26 am
SELECT ID, EMAIL
FROM (
SELECT ID, EMAIL, MostRecent = ROW_NUMBER() OVER (PARTITION BY EMAIL ORDER BY LAST_UPDATED_DT DESC)
FROM whatever
) m
WHERE MostRecent = 1
March 29, 2018 at 10:35 am
Scott Coleman - Thursday, March 29, 2018 10:26 AMSELECT ID, EMAIL
FROM (
SELECT ID, EMAIL, MostRecent = ROW_NUMBER() OVER (PARTITION BY EMAIL ORDER BY LAST_UPDATED_DT DESC)
FROM whatever
) m
WHERE MostRecent = 1
Awesome. Thank you. I love learning how to do things in multiple ways.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply