Creating Primary Column(Key) & Remove Duplicates

Question

Creating Primary Column(Key) & Remove Duplicates

Hall of Fame

Points: 3475

March 2, 2022 at 6:03 pm

Please advise, how do i create a primary key column when there is none(not even a column with unique values) and remove duplicates the same time?

Original sample data source is attached.

USE [TMI2]
GO

/****** Object:  Table [dbo].[Mapping]    Script Date: 2022/03/02 20:01:40 ******/SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Mapping](
[Link Name] [nvarchar](255) NULL,
[Feed - Asset directory ID] [nvarchar](255) NULL,
[Feed 2 - Reporting label] [float] NULL,
[Media Type] [nvarchar](255) NULL,
[Friendly Name] [nvarchar](255) NULL,
[Friendly Name File Type] [nvarchar](255) NULL,
[Friendly Name File Type - concat] [nvarchar](255) NULL,
[File source (URL)  ] [nvarchar](255) NULL,
[Friendly Name - Concatenation] [nvarchar](255) NULL,
[Audience] [nvarchar](255) NULL
) ON [PRIMARY]
GO

This topic was modified 2 years, 8 months ago by yrstruly.

Attachments:

You must be logged in to view attached files.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246962 More actions · Answer 1

You don't.

First you decide on the desired PK. Then you address any dupes. Then you issue an ALTER to set the PK.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 2

It currently looks like a "dupe" involves all of the columns that you posted. Is that correct?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Thom A SSC Guru Points: 99130 More actions · Answer 3

Adding a new column that's a Primary key isn't going to stop those duplicate rows anyway; to stop that you likely need to also implement a UNIQUE INDEX. As for deleting the rows, did using ROW_NUMBER and a CTE not work? Perhaps your PARTITION BY clause was incorrect?

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

yrstruly Hall of Fame Points: 3475 More actions · Answer 4

First i would like to able to have an unique identifier. Is the unique index the way to go?

Phil Parkin SSC Guru Points: 246962 More actions · Answer 5

yrstruly wrote:

First i would like to able to have an unique identifier. Is the unique index the way to go?

What do you hope to achieve by doing this? You cannot create a unique index on something which is not already unique. Either you resolve the 'duplicates' first, or you add one or more columns and populate them with something to create uniqueness – an IDENTITY or GUID, perhaps.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Thom A SSC Guru Points: 99130 More actions · Answer 6

Honestly, adding a new column to be a primary key candidate (such as an a column with the IDENTITY property) is the easy part here. Cleaning up the duplicate rows where all the values (apart from the new duplicate key) are the same is quite simple with a CTE and ROW_NUMBER as well. The "harder" part will be finding out why your application is inserting rows that are exactly the same.

Creating a UNIQUE INDEX or UNIQUE CONSTRAINT will cause such attempts to INSERT a duplicate row fail, causing an error to be sent to the application, but that may also have ramifications down stream; maybe your application doesn't handle the error well, maybe it needs to send the exact same row for a reason, maybe it's sending many rows at the same time and as a result you lose much more data than a single row.

You really need to do some investigative work here as well to find the root cause of your duplications. Stopping the data being created at the database level is one piece of the puzzle, if the application is going to continue to send duplicated data.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 7

yrstruly wrote:

First i would like to able to have an unique identifier. Is the unique index the way to go?

Would you answer my question about the dupes, please? And, if I'm not correct, then what columns would YOU use to identify what the dupes are?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

yrstruly Hall of Fame Points: 3475 More actions · Answer 8

Duplicates involves all columns. I would like to add an index column to identify each row.

Phil Parkin SSC Guru Points: 246962 More actions · Answer 9

yrstruly wrote:

Duplicates involves all columns. I would like to add an index column to identify each row.

Here are two duplicate rows: (x,y,z), (x,y,z)

Now tell me what your 'index column' (whatever that is) would look like.

This reply was modified 2 years, 8 months ago by Phil Parkin.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 10

yrstruly wrote:

Duplicates involves all columns. I would like to add an index column to identify each row.

If that's the case, then perhaps create a persisted computed consisting of a formula that would create a SHA1 (20 bytes) or SHA2-256 (32 bytes) hash using the HASHBYTES() function and CONCAT() to create a hash of the pipe delimited columns. SHA1 has been deprecated as a security based function but it does a pretty good job for creating a single column for a unique constraint. That would also save your Clustered Index for more important things. It wouldn't be a good "Primary Key" but it's good as an alternate key to help prevent dupes.

It's a shame that SQL Server doesn't have true "BEFORE" triggers like Oracle has to do dupe checks before any inserts or updates take place. SQL Server does have "Instead of" triggers but those can be a royal pain.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

yrstruly Hall of Fame Points: 3475 More actions · Answer 11

I changed the schema to allow NOT NULL in my matching columns. Still i get the same output.

Thom A SSC Guru Points: 99130 More actions · Answer 12

yrstruly wrote:

I changed the schema to allow NOT NULL in my matching columns. Still i get the same output.

This wasn't a suggestion; changing columns that don't allow NULL values to allow them won't change the data inside the table already, or fix your duplication issues.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk