November 5, 2010 at 10:38 am
I have a query that I run daily against a database that is dumped to me nightly. I want to be able to populate a new database called Newclients and avoid duplicates. This is the query that I'm using:
insert into msbtotal.dbo.newclients
SELECT tcms_members.dbo.memberdata2.* FROM tcms_members.dbo.memberdata2 left outer join
msbtotal.dbo.memberdata on tcms_members.dbo.memberdata2.id =
msbtotal.dbo.memberdata.id where msbtotal.dbo.memberdata.id is
null
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
Here is some of my sample data:
6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL
6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL
6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL
6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL
6009837 Some Person Some Person, MD, MPH P UT SOUTHWESTERN PSYCHIATRY RES (512) 555-5555 (512) 555-5555 Jun 26 1975 12:00AM 3501 Residence Ave. Austin TX 78731 2010-09-29 11:53:30.000 2006-06-09 00:00:00.000 2007-11-28 00:00:00.000
6009837 Some Person Some Person, MD, MPH P UT SOUTHWESTERN PSYCHIATRY RES (512) 555-5555 (512) 555-5555 Jun 26 1975 12:00AM 3501 Residence Ave. Austin TX 78731 2010-09-29 11:53:30.000 2006-06-09 00:00:00.000 2007-11-28 00:00:00.000
As you can see, there is duplicate data, and I need only one copy of each "new" record in my database. Can someone please assist me on this.
Doug
November 5, 2010 at 11:09 am
several options
1) change the data source to provide clean data to begin with (I know, most probably unlikely to happen, but that's the ideal scenario 😉 )
2) SELECT DISTINCT tcms_members.dbo.memberdata2.* instead of SELECT tcms_members.dbo.memberdata2.*
3) use GROUP BY your_list_of_columns
I'd go with option 2 ...
As a side note: instead of using INSERT INTO target SELECT * you should reference the target and source column names. This is a little more work at the beginning but it'll pay off later on (e.g. if either the source or the target will have additional columns or even a different order of columns)...
November 5, 2010 at 11:11 am
maybe
SELECT distinct tcms_members.dbo.memberdata2.* FROM tcms_members.dbo.memberdata2 left outer join
msbtotal.dbo.memberdata on tcms_members.dbo.memberdata2.id =
msbtotal.dbo.memberdata.id where msbtotal.dbo.memberdata.id is
null
or maybe if they aren't full duplicates something like
select ... from
(
SELECT tcms_members.dbo.memberdata2.*, seq = rownumber() over(partition by tcms_members.dbo.memberdata2.id order by tcms_members.dbo.memberdata2.id )
FROM tcms_members.dbo.memberdata2
left outer join msbtotal.dbo.memberdata
on tcms_members.dbo.memberdata2.id = msbtotal.dbo.memberdata.id
where msbtotal.dbo.memberdata.id is null
) where seq = 1
Cursors never.
DTS - only when needed and never to control.
November 5, 2010 at 11:56 am
Thank you both for the suggestions. I am going to go with the SELECT DISTINCT option. Now the last question I have is how do I delete all the duplicates that I already have in my table?
November 5, 2010 at 12:19 pm
Do you have any column in your table to identify a single row? (e.g. an identity column?)
If so, you could use a subquery to get either the min or max id per group and delete all other rows.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply