Creating Primary Column(Key) & Remove Duplicates

  • Please advise, how do i create a primary key column when there is none(not even a column with unique values) and remove duplicates the same time?

    Original sample data source is attached.

     

     

    USE [TMI2]
    GO

    /****** Object: Table [dbo].[Mapping] Script Date: 2022/03/02 20:01:40 ******/SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    CREATE TABLE [dbo].[Mapping](
    [Link Name] [nvarchar](255) NULL,
    [Feed - Asset directory ID] [nvarchar](255) NULL,
    [Feed 2 - Reporting label] [float] NULL,
    [Media Type] [nvarchar](255) NULL,
    [Friendly Name] [nvarchar](255) NULL,
    [Friendly Name File Type] [nvarchar](255) NULL,
    [Friendly Name File Type - concat] [nvarchar](255) NULL,
    [File source (URL) ] [nvarchar](255) NULL,
    [Friendly Name - Concatenation] [nvarchar](255) NULL,
    [Audience] [nvarchar](255) NULL
    ) ON [PRIMARY]
    GO


     

     

    • This topic was modified 2 years, 10 months ago by  yrstruly.
    Attachments:
    You must be logged in to view attached files.
  • You don't.

    First you decide on the desired PK. Then you address any dupes. Then you issue an ALTER to set the PK.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • It currently looks like a "dupe" involves all of the columns that you posted.  Is that correct?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Adding a new column that's a Primary key isn't going to stop those duplicate rows anyway; to stop that you likely need to also implement a UNIQUE INDEX. As for deleting the rows, did using ROW_NUMBER and a CTE not work? Perhaps your PARTITION BY clause was incorrect?

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • First i would like to able to have an unique identifier. Is the unique index the way to go?

  • yrstruly wrote:

    First i would like to able to have an unique identifier. Is the unique index the way to go?

    What do you hope to achieve by doing this? You cannot create a unique index on something which is not already unique. Either you resolve the 'duplicates' first, or you add one or more columns and populate them with something to create uniqueness – an IDENTITY or GUID, perhaps.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Honestly, adding a new column to be a primary key candidate (such as an a column with the IDENTITY property) is the easy part here. Cleaning up the duplicate rows where all the values (apart from the new duplicate key) are the same is quite simple with a CTE and ROW_NUMBER as well. The "harder" part will be finding out why your application is inserting rows that are exactly the same.

    Creating a UNIQUE INDEX or UNIQUE CONSTRAINT will cause such attempts to INSERT a duplicate row fail, causing an error to be sent to the application, but that may also have ramifications down stream; maybe your application doesn't handle the error well, maybe it needs to send the exact same row for a reason, maybe it's sending many rows at the same time and as a result you lose much more data than a single row.

    You really need to do some investigative work here as well to find the root cause of your duplications. Stopping the data being created at the database level is one piece of the puzzle, if the application is going to continue to send duplicated data.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • yrstruly wrote:

    First i would like to able to have an unique identifier. Is the unique index the way to go?

    Would you answer my question about the dupes, please?  And, if I'm not correct, then what columns would YOU use to identify what the dupes are?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Duplicates involves all columns. I would like to add an index column to identify each row.

  • yrstruly wrote:

    Duplicates involves all columns. I would like to add an index column to identify each row.

    Here are two duplicate rows: (x,y,z), (x,y,z)

    Now tell me what your 'index column' (whatever that is) would look like.

    • This reply was modified 2 years, 10 months ago by  Phil Parkin.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • yrstruly wrote:

    Duplicates involves all columns. I would like to add an index column to identify each row.

    If that's the case, then perhaps create a persisted computed consisting of a formula that would create a SHA1 (20 bytes) or SHA2-256 (32 bytes) hash using the HASHBYTES() function and CONCAT() to create a hash of the pipe delimited columns.  SHA1 has been deprecated as a security based function but it does a pretty good job for creating a single column for a unique constraint.  That would also save your Clustered Index for more important things.  It wouldn't be a good "Primary Key" but it's good as an alternate key to help prevent dupes.

    It's a shame that SQL Server doesn't have true "BEFORE" triggers like Oracle has to do dupe checks before any inserts or updates take place.  SQL Server does have "Instead of" triggers but those can be a royal pain.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I changed the schema to allow NOT NULL in my matching columns. Still i get the same output.

  • yrstruly wrote:

    I changed the schema to allow NOT NULL in my matching columns. Still i get the same output.

    This wasn't a suggestion; changing columns that don't allow NULL values to allow them won't change the data inside the table already, or fix your duplication issues.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply