December 6, 2012 at 10:41 am
I made a stupid mistake and I need help figuring out how to get rid of duplicate records. Our schizophrenic database is split into two parts, one for data entry, the other is a master database. We use Access as a front-end product for data entry/analysis. At the end of every year's data collection, we enter that year's data into the data entry database and then when we're done we merge it (using an append query in Access) into the Master database. The thinking (before I got here) was to simplify data entry and to protect the master database from seasonal hourlies (and full-time employees that don't have a clue), but it's caused several problems.
I mistakenly ran the append query twice for a number of tables in one year and I'm trying to figure out a good way to clear out the resulting duplicates. Here's an example: in CTAPA (data entry) I've got table dbo_ForestTrees and in CTAPM (Master) I've got table dbo_ForestTreesM. Both tables have the exact same structure and primary key, an autonumber field called sort_id, but when running the append queries we don't include the primary key fields. I need to find a script that would identify the duplicates and then get rid of them, leaving only one set of records.
December 6, 2012 at 11:10 am
Need more information about each schema to help you write the code directly, but here's the short form:
SELECT
-- all fields except identity
FROM
-- duped table
GROUP BY
-- all fields except identity
HAVING
COUNT(*) > 1
This is your list of dupes.
Next, using that as a subquery, you'd then remove whatever dupe had the higher identity. However, you must be careful. If you don't have a datetime stamp on these records to ensure actual duplication, you could very well blow away accurate data. Even with the datetime stamp you've got a problem if you can add the same line item to, say, an invoice twice. Once with a coupon and once without.
Really, since your data is time-based, I'd personally recommend blowing away everything since the previous year's load (ie: go to backup) and re-append. It's the only real safe thing here besides a manual review of EVERYTHING.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 6, 2012 at 11:34 am
Thanks for the reply! Each record has a datetime stamp, but for some reason they're not consistent, meaning that some duplicated records have the same datetime stamp, but some non-duplicated records also have the same datetime stamp. This is only (only!) about 3000 rows of data, but it's pretty annoying nonetheless. So I need to approach this from two directions: 1) removing records that are duplicated, and 2) keeping non-duplicated records that happen to have the same datetime stamp. These are records from about 30 different sites and I was hoping that the problems would be consistent within each site, but that's not the case: there's a mix of duplicated and nonduplicated records in the same site.
December 6, 2012 at 2:02 pm
ryetimothy (12/6/2012)
Thanks for the reply! Each record has a datetime stamp, but for some reason they're not consistent, meaning that some duplicated records have the same datetime stamp, but some non-duplicated records also have the same datetime stamp.
You're hosed. Sorry, highly non-technical but you've lost the ability to properly detect your duplications because you've got other duplications that are legitimate, at least if I'm reading that correctly. If it's just the timestamp the other data should detect duplicate vs. non-duplicate for you, since you'd use all the fields finding your duplicates.
I'm still thinking going to backup and starting over may be the safest bet here from an automation standpoint.
The only other method would be either a linked-server comparison between the source information and your warehouse and using that as your determiner for what's legimate, or a manual side by side review.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 6, 2012 at 2:12 pm
Yeah, that's what I was afraid of. I'm completely hosed because the data in the data entry database is usually deleted after it's merged into the master database. The only thing I can think of is to create a joined table so that I can parse out the duplication that is legitimate (i.e. same species, same time stamp, same location) from those that aren't (i.e. different species, same time stamp, maybe the same location). It's possible I might be able to sort the records by the primary key, assuming that the data were added to the database in batches and then clean things up that way. At this point I know that there are 1069 duplicate time stamps, but there is a total of 3034 rows of data (in just one table, I've got several other tables to look at). Egads I've got my work cut out for me...
December 6, 2012 at 5:59 pm
Now wait just a minute. Are you saying that, with the exception of the date/time column, that you'd need every column to determine if a row is a duplicate?
I'm thinking that this isn't a difficult problem to delete the dupes but I need more info. Please post the CREATE TABLE code (including indexes, key, and constraints) and some readily consumable data and we can show you how. What I mean by "readily consumable data" can be found in the article at the first link in my signature line below.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2012 at 6:42 pm
I think you can still dedupe your data.
here's an article on deduping to get you started
http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 7, 2012 at 10:09 am
Here's the CREATE TABLE script for the full table, the consumable data is below. I had to exclude some of the variables due to length restrictions for QUOTENAME (if I included every variable it kept spitting out null values). I also included the data for only one site for brevity. You can see that the rows for some of the species is duplicated, but that different species can also have the same DataEntryDate. I noticed there's no primary key in the original table (although sort_id is the primary key in the Access table), just a foreign key for PSID (which I realize is not numeric as it should be).
USE [CTAPM]
GO
/****** Object: Table [dbo].[WetlandTreeM] Script Date: 12/07/2012 09:35:52 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[WetlandTreeM](
[SiteID] [nvarchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[NumOfVisit] [int] NULL,
[PYear] [int] NULL,
[Organism] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Recorder] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Botanist] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Dataentr] [nvarchar](150) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[PlotSize] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[PWidth] [float] NULL,
[Plength] [float] NULL,
[Squmetr] [float] NULL,
[PlotHA] [float] NULL,
[PSID] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Genus] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Species] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[S_V_F] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Taxon] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[PMonth] [int] NULL,
[PDay] [int] NULL,
[Szclass] [nvarchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[SzclassM] [float] NULL,
[PCount] [float] NULL,
[Notes] [ntext] COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DataEntryDate] [datetime] NULL,
[WetlandTreeSites_ID] [int] NULL,
[sort_id] [int] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_WetlandTreeM] PRIMARY KEY CLUSTERED
(
[sort_id] ASC
) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Here's the code for the consumable data:
--===== If the test table already exists, drop it
IF OBJECT_ID('TempDB..WetlandTemp','U') IS NOT NULL
DROP TABLE WetlandTemp
--===== Create the test table with
CREATE TABLE WetlandTemp
(
[SiteID] [nvarchar](25),
[NumOfVisit] [int] NULL,
[PYear] [int] NULL,
[PSID] [nvarchar](255),
[Szclass] [nvarchar](25),
[PCount] [float] NULL,
[DataEntryDate] [datetime] NULL,
[sort_id] [int] IDENTITY(1,1) NOT NULL,
)
--===== All Inserts into the IDENTITY column
SET IDENTITY_INSERT WetlandTemp ON
--===== Insert the test data into the test table
INSERT INTO WetlandTemp
(SiteID, NumOfVisit, PYear, PSID, Szclass, PCount, DataEntryDate, sort_id)
SELECT '154304W','3','2009','Diospyros virginiana','A','1','Aug 10 2009 10:06AM','1136'UNION ALL
SELECT '154304W','3','2009','Celtis occidentalis','C','1','Aug 10 2009 10:07AM','1137'UNION ALL
SELECT '154304W','3','2009','Ailanthus altissima','B','1','Aug 10 2009 10:08AM','1138'UNION ALL
SELECT '154304W','3','2009','Acer negundo','A','13','Aug 10 2009 10:08AM','1139'UNION ALL
SELECT '154304W','3','2009','Acer negundo','B','5','Aug 10 2009 10:08AM','1140'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1141'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1142'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1143'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1144'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1145'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1146'UNION ALL
SELECT '154304W','3','2009','Robinia pseudoacacia','E','1','Aug 10 2009 10:13AM','1147'UNION ALL
SELECT '154304W','3','2009','Fraxinus pennsylvanica','C','2','Aug 10 2009 10:15AM','1148'UNION ALL
SELECT '154304W','3','2009','Fraxinus pennsylvanica','D','1','Aug 10 2009 10:15AM','1149'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1150'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1151'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1152'UNION ALL
SELECT '154304W','3','2009','Ulmus americana','A','2','Aug 10 2009 10:15AM','1153'UNION ALL
SELECT '154304W','3','2009','Quercus imbricaria','A','1','Aug 10 2009 10:15AM','1154'UNION ALL
SELECT '154304W','3','2009','Diospyros virginiana','A','1','Aug 10 2009 10:06AM','1260'UNION ALL
SELECT '154304W','3','2009','Celtis occidentalis','C','1','Aug 10 2009 10:07AM','1261'UNION ALL
SELECT '154304W','3','2009','Ailanthus altissima','B','1','Aug 10 2009 10:08AM','1262'UNION ALL
SELECT '154304W','3','2009','Acer negundo','A','13','Aug 10 2009 10:08AM','1263'UNION ALL
SELECT '154304W','3','2009','Acer negundo','B','5','Aug 10 2009 10:08AM','1264'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1265'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1266'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1267'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1268'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1269'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1270'UNION ALL
SELECT '154304W','3','2009','Robinia pseudoacacia','E','1','Aug 10 2009 10:13AM','1271'UNION ALL
SELECT '154304W','3','2009','Fraxinus pennsylvanica','C','2','Aug 10 2009 10:15AM','1272'UNION ALL
SELECT '154304W','3','2009','Fraxinus pennsylvanica','D','1','Aug 10 2009 10:15AM','1273'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1274'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1275'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1276'UNION ALL
SELECT '154304W','3','2009','Ulmus americana','A','2','Aug 10 2009 10:15AM','1277'UNION ALL
SELECT '154304W','3','2009','Quercus imbricaria','A','1','Aug 10 2009 10:15AM','1278'
--===== Set the identity insert back to normal
SET IDENTITY_INSERT WetlandTemp OFF
December 7, 2012 at 10:23 am
This should be easily deduped using that script in the link I provided - you will just need to customize it for your table.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 7, 2012 at 10:48 am
Tried running the code in your link (I just copied the code and pasted it) for practice before using my own data, got the following error messages:
Msg 156, Level 15, State 1, Line 16
Incorrect syntax near the keyword 'WITH'.
Msg 195, Level 15, State 10, Line 18
'ROW_NUMBER' is not a recognized function name.
Msg 156, Level 15, State 1, Line 24
Incorrect syntax near the keyword 'WITH'.
Msg 195, Level 15, State 10, Line 26
'ROW_NUMBER' is not a recognized function name.
I'm running SQL Server 2005 on my machine, but the server I've been working with is still running SQL Server 2000. Not sure if that's the issue.
December 7, 2012 at 11:06 am
Nope, can't run that code on SQL 2000. I figured it was SQL 2008 since it wasn't noted and this is the 2008 forum.
It will take a different method - but it is still doable and will take a bit of time to work it out.
I'll post back if I get it working before somebody else.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 7, 2012 at 11:40 am
ryetimothy (12/7/2012)
Here's the CREATE TABLE script for the full table, the consumable data is below. I had to exclude some of the variables due to length restrictions for QUOTENAME (if I included every variable it kept spitting out null values). I also included the data for only one site for brevity. You can see that the rows for some of the species is duplicated, but that different species can also have the same DataEntryDate.
Ah hah, okay, thanks Jeff. You read something different from his description than I did. I thought there were valid dupes as well as invalid. Apologies about that Timothy.
Yes, this can be deduped, as the other have already pointed out. SQL 2k? Yeah, I can find a way to make this work. Will take a bit though, need to find a dev 2k machine, I forgot where we stuck it... :blush:
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 7, 2012 at 11:55 am
It's my fault for using confusing syntax. I had a way of dealing with the duplicates in Access, but the presence of records with identical data entry dates was causing me most of my troubles. I was planning on creating a temporary table, making DataEntryDate the primary key and then appending the old data to the new table, which would have kicked out the duplicates. Unfortunately, because the DataEntryDate value was the same for multiple rows (and not just those that we duplicates), it kicked out every record with the same DataEntryDate except for one. On the plus side my method would have gotten rid of duplicates, but it would have gotten rid of other records as well.
Yeah, we're running some pretty old machines here. We just got a new server a few months ago (with an up to date version of SQL Server installed) and over the next month or two I'll be playing with migrating our data over to the new server. Hopefully that should solve some of our data and connectivity problems.
December 7, 2012 at 12:12 pm
ryetimothy (12/7/2012)
SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1265'UNION ALLSELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1266'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1267'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1268'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1269'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1270'UNION ALL
...
SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1274'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1275'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1276'UNION ALL
So, which rows do you want to eliminate and keep for each of those two species and why?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 7, 2012 at 12:27 pm
It looks like there are 38 rows of data in the table, but there should only be 19. Rows with sort_id's from 1136 to 1154 are one group and were the first to be appended and they're listed below:
SELECT '154304W','3','2009','Diospyros virginiana','A','1','Aug 10 2009 10:06AM','1136'UNION ALL
SELECT '154304W','3','2009','Celtis occidentalis','C','1','Aug 10 2009 10:07AM','1137'UNION ALL
SELECT '154304W','3','2009','Ailanthus altissima','B','1','Aug 10 2009 10:08AM','1138'UNION ALL
SELECT '154304W','3','2009','Acer negundo','A','13','Aug 10 2009 10:08AM','1139'UNION ALL
SELECT '154304W','3','2009','Acer negundo','B','5','Aug 10 2009 10:08AM','1140'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1141'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1142'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1143'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1144'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1145'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1146'UNION ALL
SELECT '154304W','3','2009','Robinia pseudoacacia','E','1','Aug 10 2009 10:13AM','1147'UNION ALL
SELECT '154304W','3','2009','Fraxinus pennsylvanica','C','2','Aug 10 2009 10:15AM','1148'UNION ALL
SELECT '154304W','3','2009','Fraxinus pennsylvanica','D','1','Aug 10 2009 10:15AM','1149'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','B','2','Aug 10 2009 10:15AM','1150'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','C','3','Aug 10 2009 10:15AM','1151'UNION ALL
SELECT '154304W','3','2009','Quercus palustris','D','2','Aug 10 2009 10:15AM','1152'UNION ALL
SELECT '154304W','3','2009','Ulmus americana','A','2','Aug 10 2009 10:15AM','1153'UNION ALL
SELECT '154304W','3','2009','Quercus imbricaria','A','1','Aug 10 2009 10:15AM','1154'UNION ALL
The other rows have a different sequence of sort_id's and were appended at a later date, so they're the duplicates and should go. Essentially, this is tree data and we record trees in different size classes (Szclass) and count the number of trees of each species in each size class (Pcount). So for example, Acer saccharinum has 6 rows of data because we found trees in 6 different size classes.
SELECT '154304W','3','2009','Acer saccharinum','A','4','Aug 10 2009 10:08AM','1141'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','B','6','Aug 10 2009 10:08AM','1142'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','C','4','Aug 10 2009 10:08AM','1143'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','D','1','Aug 10 2009 10:08AM','1144'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','E','5','Aug 10 2009 10:08AM','1145'UNION ALL
SELECT '154304W','3','2009','Acer saccharinum','F','4','Aug 10 2009 10:12AM','1146'UNION ALL
In size class A we found 4 individuals, there were 6 in size class B, 4 in size class C, etc. With the exception of size class F, all of these rows have the same data entry date, probably because when doing data entry, the person doing the data entry created one row of data and then copied and pasted and then changed anything that needed to be changed for each size class.
Edit:Anything with the same PSID AND Szclass AND PCount is a duplicate
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply