March 2, 2022 at 6:03 pm
Please advise, how do i create a primary key column when there is none(not even a column with unique values) and remove duplicates the same time?
Original sample data source is attached.
USE [TMI2]
GO
/****** Object: Table [dbo].[Mapping] Script Date: 2022/03/02 20:01:40 ******/SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Mapping](
[Link Name] [nvarchar](255) NULL,
[Feed - Asset directory ID] [nvarchar](255) NULL,
[Feed 2 - Reporting label] [float] NULL,
[Media Type] [nvarchar](255) NULL,
[Friendly Name] [nvarchar](255) NULL,
[Friendly Name File Type] [nvarchar](255) NULL,
[Friendly Name File Type - concat] [nvarchar](255) NULL,
[File source (URL) ] [nvarchar](255) NULL,
[Friendly Name - Concatenation] [nvarchar](255) NULL,
[Audience] [nvarchar](255) NULL
) ON [PRIMARY]
GO
March 2, 2022 at 7:00 pm
You don't.
First you decide on the desired PK. Then you address any dupes. Then you issue an ALTER to set the PK.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
March 2, 2022 at 7:33 pm
It currently looks like a "dupe" involves all of the columns that you posted. Is that correct?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 3, 2022 at 9:14 am
Adding a new column that's a Primary key isn't going to stop those duplicate rows anyway; to stop that you likely need to also implement a UNIQUE INDEX
. As for deleting the rows, did using ROW_NUMBER
and a CTE not work? Perhaps your PARTITION BY
clause was incorrect?
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
March 3, 2022 at 10:47 am
First i would like to able to have an unique identifier. Is the unique index the way to go?
March 3, 2022 at 11:23 am
First i would like to able to have an unique identifier. Is the unique index the way to go?
What do you hope to achieve by doing this? You cannot create a unique index on something which is not already unique. Either you resolve the 'duplicates' first, or you add one or more columns and populate them with something to create uniqueness – an IDENTITY or GUID, perhaps.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
March 3, 2022 at 11:45 am
Honestly, adding a new column to be a primary key candidate (such as an a column with the IDENTITY
property) is the easy part here. Cleaning up the duplicate rows where all the values (apart from the new duplicate key) are the same is quite simple with a CTE and ROW_NUMBER
as well. The "harder" part will be finding out why your application is inserting rows that are exactly the same.
Creating a UNIQUE INDEX
or UNIQUE CONSTRAINT
will cause such attempts to INSERT
a duplicate row fail, causing an error to be sent to the application, but that may also have ramifications down stream; maybe your application doesn't handle the error well, maybe it needs to send the exact same row for a reason, maybe it's sending many rows at the same time and as a result you lose much more data than a single row.
You really need to do some investigative work here as well to find the root cause of your duplications. Stopping the data being created at the database level is one piece of the puzzle, if the application is going to continue to send duplicated data.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
March 3, 2022 at 4:58 pm
First i would like to able to have an unique identifier. Is the unique index the way to go?
Would you answer my question about the dupes, please? And, if I'm not correct, then what columns would YOU use to identify what the dupes are?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 3, 2022 at 5:15 pm
Duplicates involves all columns. I would like to add an index column to identify each row.
March 3, 2022 at 5:27 pm
Duplicates involves all columns. I would like to add an index column to identify each row.
Here are two duplicate rows: (x,y,z), (x,y,z)
Now tell me what your 'index column' (whatever that is) would look like.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
March 3, 2022 at 6:24 pm
Duplicates involves all columns. I would like to add an index column to identify each row.
If that's the case, then perhaps create a persisted computed consisting of a formula that would create a SHA1 (20 bytes) or SHA2-256 (32 bytes) hash using the HASHBYTES() function and CONCAT() to create a hash of the pipe delimited columns. SHA1 has been deprecated as a security based function but it does a pretty good job for creating a single column for a unique constraint. That would also save your Clustered Index for more important things. It wouldn't be a good "Primary Key" but it's good as an alternate key to help prevent dupes.
It's a shame that SQL Server doesn't have true "BEFORE" triggers like Oracle has to do dupe checks before any inserts or updates take place. SQL Server does have "Instead of" triggers but those can be a royal pain.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 4, 2022 at 1:20 pm
I changed the schema to allow NOT NULL in my matching columns. Still i get the same output.
March 4, 2022 at 1:42 pm
I changed the schema to allow NOT NULL in my matching columns. Still i get the same output.
This wasn't a suggestion; changing columns that don't allow NULL
values to allow them won't change the data inside the table already, or fix your duplication issues.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply