Made stupid mistake, need help removing duplicates

Question

Post reply

Made stupid mistake, need help removing duplicates

ryetimothy

Old Hand

Points: 300
More actions
December 6, 2012 at 10:41 am

#285914

I made a stupid mistake and I need help figuring out how to get rid of duplicate records. Our schizophrenic database is split into two parts, one for data entry, the other is a master database. We use Access as a front-end product for data entry/analysis. At the end of every year's data collection, we enter that year's data into the data entry database and then when we're done we merge it (using an append query in Access) into the Master database. The thinking (before I got here) was to simplify data entry and to protect the master database from seasonal hourlies (and full-time employees that don't have a clue), but it's caused several problems.
I mistakenly ran the append query twice for a number of tables in one year and I'm trying to figure out a good way to clear out the resulting duplicates. Here's an example: in CTAPA (data entry) I've got table dbo_ForestTrees and in CTAPM (Master) I've got table dbo_ForestTreesM. Both tables have the exact same structure and primary key, an autonumber field called sort_id, but when running the append queries we don't include the primary key fields. I need to find a script that would identify the duplicates and then get rid of them, leaving only one set of records.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

Need more information about each schema to help you write the code directly, but here's the short form:

SELECT

-- all fields except identity

FROM

-- duped table

GROUP BY

-- all fields except identity

HAVING

COUNT(*) > 1

This is your list of dupes.

Next, using that as a subquery, you'd then remove whatever dupe had the higher identity. However, you must be careful. If you don't have a datetime stamp on these records to ensure actual duplication, you could very well blow away accurate data. Even with the datetime stamp you've got a problem if you can add the same line item to, say, an invoice twice. Once with a coupon and once without.

Really, since your data is time-based, I'd personally recommend blowing away everything since the previous year's load (ie: go to backup) and re-append. It's the only real safe thing here besides a manual review of EVERYTHING.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

ryetimothy Old Hand Points: 300 More actions · Answer 2

Thanks for the reply! Each record has a datetime stamp, but for some reason they're not consistent, meaning that some duplicated records have the same datetime stamp, but some non-duplicated records also have the same datetime stamp. This is only (only!) about 3000 rows of data, but it's pretty annoying nonetheless. So I need to approach this from two directions: 1) removing records that are duplicated, and 2) keeping non-duplicated records that happen to have the same datetime stamp. These are records from about 30 different sites and I was hoping that the problems would be consistent within each site, but that's not the case: there's a mix of duplicated and nonduplicated records in the same site.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

ryetimothy (12/6/2012)
Thanks for the reply! Each record has a datetime stamp, but for some reason they're not consistent, meaning that some duplicated records have the same datetime stamp, but some non-duplicated records also have the same datetime stamp.

You're hosed. Sorry, highly non-technical but you've lost the ability to properly detect your duplications because you've got other duplications that are legitimate, at least if I'm reading that correctly. If it's just the timestamp the other data should detect duplicate vs. non-duplicate for you, since you'd use all the fields finding your duplicates.

I'm still thinking going to backup and starting over may be the safest bet here from an automation standpoint.

The only other method would be either a linked-server comparison between the source information and your warehouse and using that as your determiner for what's legimate, or a manual side by side review.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

ryetimothy Old Hand Points: 300 More actions · Answer 4

Yeah, that's what I was afraid of. I'm completely hosed because the data in the data entry database is usually deleted after it's merged into the master database. The only thing I can think of is to create a joined table so that I can parse out the duplication that is legitimate (i.e. same species, same time stamp, same location) from those that aren't (i.e. different species, same time stamp, maybe the same location). It's possible I might be able to sort the records by the primary key, assuming that the data were added to the database in batches and then clean things up that way. At this point I know that there are 1069 duplicate time stamps, but there is a total of 3034 rows of data (in just one table, I've got several other tables to look at). Egads I've got my work cut out for me...

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 5

Now wait just a minute. Are you saying that, with the exception of the date/time column, that you'd need every column to determine if a row is a duplicate?

I'm thinking that this isn't a difficult problem to delete the dupes but I need more info. Please post the CREATE TABLE code (including indexes, key, and constraints) and some readily consumable data and we can show you how. What I mean by "readily consumable data" can be found in the article at the first link in my signature line below.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRNNR SSC Guru Points: 281334 More actions · Answer 6

I think you can still dedupe your data.

here's an article on deduping to get you started

http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

ryetimothy Old Hand Points: 300 More actions · Answer 7

Here's the CREATE TABLE script for the full table, the consumable data is below. I had to exclude some of the variables due to length restrictions for QUOTENAME (if I included every variable it kept spitting out null values). I also included the data for only one site for brevity. You can see that the rows for some of the species is duplicated, but that different species can also have the same DataEntryDate. I noticed there's no primary key in the original table (although sort_id is the primary key in the Access table), just a foreign key for PSID (which I realize is not numeric as it should be).

USE [CTAPM]

GO

/****** Object: Table [dbo].[WetlandTreeM] Script Date: 12/07/2012 09:35:52 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[WetlandTreeM](

[SiteID] [nvarchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[NumOfVisit] [int] NULL,

[PYear] [int] NULL,

[Organism] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Recorder] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Botanist] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Dataentr] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[PlotSize] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[PWidth] [float] NULL,

[Plength] [float] NULL,

[Squmetr] [float] NULL,

[PlotHA] [float] NULL,

[PSID] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Genus] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Species] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[S_V_F] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[Taxon] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[PMonth] [int] NULL,

[PDay] [int] NULL,

[Szclass] [nvarchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[SzclassM] [float] NULL,

[PCount] [float] NULL,

[Notes] [ntext] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[DataEntryDate] [datetime] NULL,

[WetlandTreeSites_ID] [int] NULL,

[sort_id] [int] IDENTITY(1,1) NOT NULL,

CONSTRAINT [PK_WetlandTreeM] PRIMARY KEY CLUSTERED

(

[sort_id] ASC

) ON [PRIMARY]

) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Here's the code for the consumable data:

--===== If the test table already exists, drop it

IF OBJECT_ID('TempDB..WetlandTemp','U') IS NOT NULL

DROP TABLE WetlandTemp

--===== Create the test table with

CREATE TABLE WetlandTemp

(

[SiteID] [nvarchar](25),

[NumOfVisit] [int] NULL,

[PYear] [int] NULL,

[PSID] [nvarchar](255),

[Szclass] [nvarchar](25),

[PCount] [float] NULL,

[DataEntryDate] [datetime] NULL,

[sort_id] [int] IDENTITY(1,1) NOT NULL,

)

--===== All Inserts into the IDENTITY column

SET IDENTITY_INSERT WetlandTemp ON

--===== Insert the test data into the test table

INSERT INTO WetlandTemp

(SiteID, NumOfVisit, PYear, PSID, Szclass, PCount, DataEntryDate, sort_id)

SELECT '154304W','3','2009','Diospyros virginiana','A','1','Aug 10 2009 10:06AM','1136'UNION ALL

SELECT '154304W','3','2009','Celtis occidentalis','C','1','Aug 10 2009 10:07AM','1137'UNION ALL

SELECT '154304W','3','2009','Ailanthus altissima','B','1','Aug 10 2009 10:08AM','1138'UNION ALL

SELECT '154304W','3','2009','Acer negundo','A','13','Aug 10 2009 10:08AM','1139'UNION ALL

SELECT '154304W','3','2009','Acer negundo','B','5','Aug 10 2009 10:08AM','1140'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1141'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1142'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1143'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1144'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1145'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1146'UNION ALL

SELECT '154304W','3','2009','Robinia pseudoacacia','E','1','Aug 10 2009 10:13AM','1147'UNION ALL

SELECT '154304W','3','2009','Fraxinus pennsylvanica','C','2','Aug 10 2009 10:15AM','1148'UNION ALL

SELECT '154304W','3','2009','Fraxinus pennsylvanica','D','1','Aug 10 2009 10:15AM','1149'UNION ALL

SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1150'UNION ALL

SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1151'UNION ALL

SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1152'UNION ALL

SELECT '154304W','3','2009','Ulmus americana','A','2','Aug 10 2009 10:15AM','1153'UNION ALL

SELECT '154304W','3','2009','Quercus imbricaria','A','1','Aug 10 2009 10:15AM','1154'UNION ALL

SELECT '154304W','3','2009','Diospyros virginiana','A','1','Aug 10 2009 10:06AM','1260'UNION ALL

SELECT '154304W','3','2009','Celtis occidentalis','C','1','Aug 10 2009 10:07AM','1261'UNION ALL

SELECT '154304W','3','2009','Ailanthus altissima','B','1','Aug 10 2009 10:08AM','1262'UNION ALL

SELECT '154304W','3','2009','Acer negundo','A','13','Aug 10 2009 10:08AM','1263'UNION ALL

SELECT '154304W','3','2009','Acer negundo','B','5','Aug 10 2009 10:08AM','1264'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1265'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1266'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1267'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1268'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1269'UNION ALL

SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1270'UNION ALL

SELECT '154304W','3','2009','Robinia pseudoacacia','E','1','Aug 10 2009 10:13AM','1271'UNION ALL

SELECT '154304W','3','2009','Fraxinus pennsylvanica','C','2','Aug 10 2009 10:15AM','1272'UNION ALL

SELECT '154304W','3','2009','Fraxinus pennsylvanica','D','1','Aug 10 2009 10:15AM','1273'UNION ALL

SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1274'UNION ALL

SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1275'UNION ALL

SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1276'UNION ALL

SELECT '154304W','3','2009','Ulmus americana','A','2','Aug 10 2009 10:15AM','1277'UNION ALL

SELECT '154304W','3','2009','Quercus imbricaria','A','1','Aug 10 2009 10:15AM','1278'

--===== Set the identity insert back to normal

SET IDENTITY_INSERT WetlandTemp OFF

SQLRNNR SSC Guru Points: 281334 More actions · Answer 8

This should be easily deduped using that script in the link I provided - you will just need to customize it for your table.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

ryetimothy Old Hand Points: 300 More actions · Answer 9

Tried running the code in your link (I just copied the code and pasted it) for practice before using my own data, got the following error messages:

Msg 156, Level 15, State 1, Line 16

Incorrect syntax near the keyword 'WITH'.

Msg 195, Level 15, State 10, Line 18

'ROW_NUMBER' is not a recognized function name.

Msg 156, Level 15, State 1, Line 24

Incorrect syntax near the keyword 'WITH'.

Msg 195, Level 15, State 10, Line 26

'ROW_NUMBER' is not a recognized function name.

I'm running SQL Server 2005 on my machine, but the server I've been working with is still running SQL Server 2000. Not sure if that's the issue.

SQLRNNR SSC Guru Points: 281334 More actions · Answer 10

Nope, can't run that code on SQL 2000. I figured it was SQL 2008 since it wasn't noted and this is the 2008 forum.

It will take a different method - but it is still doable and will take a bit of time to work it out.

I'll post back if I get it working before somebody else.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 11

ryetimothy (12/7/2012)
Here's the CREATE TABLE script for the full table, the consumable data is below. I had to exclude some of the variables due to length restrictions for QUOTENAME (if I included every variable it kept spitting out null values). I also included the data for only one site for brevity. You can see that the rows for some of the species is duplicated, but that different species can also have the same DataEntryDate.

Ah hah, okay, thanks Jeff. You read something different from his description than I did. I thought there were valid dupes as well as invalid. Apologies about that Timothy.

Yes, this can be deduped, as the other have already pointed out. SQL 2k? Yeah, I can find a way to make this work. Will take a bit though, need to find a dev 2k machine, I forgot where we stuck it... :blush:

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

ryetimothy Old Hand Points: 300 More actions · Answer 12

It's my fault for using confusing syntax. I had a way of dealing with the duplicates in Access, but the presence of records with identical data entry dates was causing me most of my troubles. I was planning on creating a temporary table, making DataEntryDate the primary key and then appending the old data to the new table, which would have kicked out the duplicates. Unfortunately, because the DataEntryDate value was the same for multiple rows (and not just those that we duplicates), it kicked out every record with the same DataEntryDate except for one. On the plus side my method would have gotten rid of duplicates, but it would have gotten rid of other records as well.

Yeah, we're running some pretty old machines here. We just got a new server a few months ago (with an up to date version of SQL Server installed) and over the next month or two I'll be playing with migrating our data over to the new server. Hopefully that should solve some of our data and connectivity problems.

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 13

ryetimothy (12/7/2012)
SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1265'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1266'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1267'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1268'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1269'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1270'UNION ALL
...
SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1274'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1275'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1276'UNION ALL

So, which rows do you want to eliminate and keep for each of those two species and why?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ryetimothy Old Hand Points: 300 More actions · Answer 14

It looks like there are 38 rows of data in the table, but there should only be 19. Rows with sort_id's from 1136 to 1154 are one group and were the first to be appended and they're listed below: