How to bypass duplicate records ?

  • Does anyone have any code to ignore duplicate records ?

    I need to compare the current value of a field, with the next record value of the same field, and if they are the same, move to the next record, until they are not the same, all within the same table.

    Just using 'Distinct' or 'Group By' within a Select statement is not giving me what i need !

    I know duplicates are bad, but i dont want to delete my duplicate records just yet.

    Thanks

  • How do you mean 'next' is this is a cursor ?

    you could use row_number() (partition by .....) to detect the 'duplicates'



    Clear Sky SQL
    My Blog[/url]

  • What i mean is:

    record1 - email, other fileds

    record2 - email, other fields

    record3 - email, other fields

    If record2.email = record1.email NEXT

    If record3.email = record1.email NEXT

    etc ..........

    Thanks

  • Could you explain in more detail what it is that you're trying to do?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I link to an article in my signature that will show you how to post your question and get better answers faster. Please do yourself a favor and review that article and repost your question.

    Your question does not make sense, since a table in SQL Server is unordered. The only way order is defined is by using the ORDER BY clause in a query.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • I have a bunch of resumes in my database, which include duplicates.

    ResumeData Table

    resume

    email

    What i am trying to do is create a DB VIEW of the same data, without the duplicates.

    ResumeDate VIEW

    resume

    email

    eg. I have 71 Livelink records

    When i run Select * - i get 71 records

    When i run Select distinct email - i get 35 records (which obviously mean i have 35 unique email addresses)

    When i run Select distinct email, resume - i get 55 records (which is NOT what i want)

    I want my view to have 35 emails and resumes.

    Is this clearer ?

    Thanks

  • From that I'd conclude you have more then one resume for some of the emails. In those cases, which resume do you want back?

    Please post table definitions, sample data and desired output. Read this to see the best way to post this to get quick responses.

    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Nope, not really clearer at all. How are you going to decide which combination of Resume/Email is the one you want? How do you know it is the right one?

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • CREATE TABLE resumein

    (

    ID3 INT IDENTITY(1,1) NOT NULL,

    OriginalResume nvarchar(max) NULL,

    Email1 nvarchar(1000) NULL,

    CONSTRAINT PK_ID3 PRIMARY KEY CLUSTERED (ID3)

    )

    GO

    MY data:

    john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

    bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee

    john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap

    Since i have 2 'john smith' records with the same email address, i would like my DB VIew to look like this.

    john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

    bob smith, bsmith@gmail.com, address, city, state, zip, livelink, j2ee

    If you notice, one john smith resume has 'livelink, java' and another has 'livelink, sap', that is i want to seperate based on email alone !

    better ?

    Thanks

  • Better - but still not quite there yet. How do you decide which row you are going to return? How are you deciding that you want 'livelink, java' as opposed to 'livelink, sap'?

    Are those separate columns? Such that you could have a row with 'mylink' and 'java' - or would that be in the same column as 'mylink, java'?

    If separate columns, how are you going to decide which one you want? Livelink or mylink?

    Now, what about the email column? What happens if you have a row for John Smith with a different email address? Which one do you want?

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Your sample data doesn't fit in the table specified. Do you mean this as sample data, or should the table have more columns?

    Insert into resumein (OriginalResume, OriginalEmail)

    Values ('java',' jsmith@gmail.com')

    Insert into resumein (OriginalResume, OriginalEmail)

    Values ('SAP',' jsmith@gmail.com')

    Insert into resumein (OriginalResume, OriginalEmail)

    Values ('j2ee',' bsmith@gmail.com')

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jeffrey Williams (7/2/2009)


    Better - but still not quite there yet. How do you decide which row you are going to return? How are you deciding that you want 'livelink, java' as opposed to 'livelink, sap'?

    I will grab the first record - in this case 'livelink, java'

    Are those separate columns?

    Yes 3 columns

    ID

    resume

    email

    Such that you could have a row with 'mylink' and 'java' - or would that be in the same column as 'mylink, java'?

    same column - resume

    Now, what about the email column? What happens if you have a row for John Smith with a different email address? Which one do

    you want?

    I want EACH unique email address.

    Thanks

  • GilaMonster (7/2/2009)


    Your sample data doesn't fit in the table specified. Do you mean this as sample data, or should the table have more columns?

    Insert into resumein (OriginalResume, OriginalEmail)

    Values ('java',' jsmith@gmail.com')

    Insert into resumein (OriginalResume, OriginalEmail)

    Values ('SAP',' jsmith@gmail.com')

    Insert into resumein (OriginalResume, OriginalEmail)

    Values ('j2ee',' bsmith@gmail.com')

    resumerecord1.......john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

    resumerecord2 ......bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee

    resumerecord3......john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap

    Imagine you are a Livelink candidate applying for a Livelink position.

    You would send me your resume, typically many pages.

    I receive your resume, via email, convert to a text file and import into my SQL DB, using SSIS while parsing out your email address.

    This gives me:

    Your Email address

    Your Resume

    When i get future Livelink positons that you may be interested in, i search my resume DB for Livelink candidates using Contains "Livelink" since it is FTI.

    Your email would pop up and i could send you the job description from the email address.

    Since the database has duplicates, i dont want to send you 2 emails and make you mad !

    Better ?

    Thanks

  • ifila (7/2/2009)


    I will grab the first record - in this case 'livelink, java'

    The one with the lowest ID?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Okay, that really doesn't help much. In your previous post you stated this:

    MY data:

    john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

    bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee

    john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap

    This does not match up with your table definition that contains 3 columns. Based upon your last post, and those three columns - you can try the following:

    ;WITH cteDUPS (ID, Email, Resume, rn)

    AS (SELECT ID3, Email1, OriginalResume, row_number() OVER(PARTITION BY Email1 ORDER BY ID) AS rn FROM dbo.ResumeIN)

    SELECT * FROM cteDUPS WHERE rn = 1;

    Just a guess on what you have supplied so far.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply