How to bypass duplicate records ?

Question

Post reply

How to bypass duplicate records ?

isuckatsql

SSChampion

Points: 10381
More actions
July 2, 2009 at 8:18 am

#135861

Does anyone have any code to ignore duplicate records ?
I need to compare the current value of a field, with the next record value of the same field, and if they are the same, move to the next record, until they are not the same, all within the same table.
Just using 'Distinct' or 'Group By' within a Select statement is not giving me what i need !
I know duplicates are bad, but i dont want to delete my duplicate records just yet.
Thanks

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

Dave Ballantyne SSC-Dedicated Points: 33667 More actions · Answer 1

How do you mean 'next' is this is a cursor ?

you could use row_number() (partition by .....) to detect the 'duplicates'

Clear Sky SQL
My Blog[/url]

isuckatsql SSChampion Points: 10381 More actions · Answer 2

What i mean is:

record1 - email, other fileds

record2 - email, other fields

record3 - email, other fields

If record2.email = record1.email NEXT

If record3.email = record1.email NEXT

etc ..........

Thanks

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Could you explain in more detail what it is that you're trying to do?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jeffrey Williams SSC Guru Points: 90304 More actions · Answer 4

I link to an article in my signature that will show you how to post your question and get better answers faster. Please do yourself a favor and review that article and repost your question.

Your question does not make sense, since a table in SQL Server is unordered. The only way order is defined is by using the ORDER BY clause in a query.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

isuckatsql SSChampion Points: 10381 More actions · Answer 5

I have a bunch of resumes in my database, which include duplicates.

ResumeData Table

resume

email

What i am trying to do is create a DB VIEW of the same data, without the duplicates.

ResumeDate VIEW

resume

email

eg. I have 71 Livelink records

When i run Select * - i get 71 records

When i run Select distinct email - i get 35 records (which obviously mean i have 35 unique email addresses)

When i run Select distinct email, resume - i get 55 records (which is NOT what i want)

I want my view to have 35 emails and resumes.

Is this clearer ?

Thanks

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

From that I'd conclude you have more then one resume for some of the emails. In those cases, which resume do you want back?

Please post table definitions, sample data and desired output. Read this to see the best way to post this to get quick responses.

http://www.sqlservercentral.com/articles/Best+Practices/61537/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jeffrey Williams SSC Guru Points: 90304 More actions · Answer 7

Nope, not really clearer at all. How are you going to decide which combination of Resume/Email is the one you want? How do you know it is the right one?

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

isuckatsql SSChampion Points: 10381 More actions · Answer 8

CREATE TABLE resumein

(

ID3 INT IDENTITY(1,1) NOT NULL,

OriginalResume nvarchar(max) NULL,

Email1 nvarchar(1000) NULL,

CONSTRAINT PK_ID3 PRIMARY KEY CLUSTERED (ID3)

)

GO

MY data:

john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee

john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap

Since i have 2 'john smith' records with the same email address, i would like my DB VIew to look like this.

john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

bob smith, bsmith@gmail.com, address, city, state, zip, livelink, j2ee

If you notice, one john smith resume has 'livelink, java' and another has 'livelink, sap', that is i want to seperate based on email alone !

better ?

Thanks

Jeffrey Williams SSC Guru Points: 90304 More actions · Answer 9

Better - but still not quite there yet. How do you decide which row you are going to return? How are you deciding that you want 'livelink, java' as opposed to 'livelink, sap'?

Are those separate columns? Such that you could have a row with 'mylink' and 'java' - or would that be in the same column as 'mylink, java'?

If separate columns, how are you going to decide which one you want? Livelink or mylink?

Now, what about the email column? What happens if you have a row for John Smith with a different email address? Which one do you want?

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 10

Your sample data doesn't fit in the table specified. Do you mean this as sample data, or should the table have more columns?

Insert into resumein (OriginalResume, OriginalEmail)

Values ('java',' jsmith@gmail.com')

Insert into resumein (OriginalResume, OriginalEmail)

Values ('SAP',' jsmith@gmail.com')

Insert into resumein (OriginalResume, OriginalEmail)

Values ('j2ee',' bsmith@gmail.com')

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

isuckatsql SSChampion Points: 10381 More actions · Answer 11

Jeffrey Williams (7/2/2009)
Better - but still not quite there yet. How do you decide which row you are going to return? How are you deciding that you want 'livelink, java' as opposed to 'livelink, sap'?
I will grab the first record - in this case 'livelink, java'
Are those separate columns?
Yes 3 columns
ID
resume
email
Such that you could have a row with 'mylink' and 'java' - or would that be in the same column as 'mylink, java'?
same column - resume
Now, what about the email column? What happens if you have a row for John Smith with a different email address? Which one do
you want?
I want EACH unique email address.
Thanks

isuckatsql SSChampion Points: 10381 More actions · Answer 12

GilaMonster (7/2/2009)
Your sample data doesn't fit in the table specified. Do you mean this as sample data, or should the table have more columns?
Insert into resumein (OriginalResume, OriginalEmail)
Values ('java',' jsmith@gmail.com')
Insert into resumein (OriginalResume, OriginalEmail)
Values ('SAP',' jsmith@gmail.com')
Insert into resumein (OriginalResume, OriginalEmail)
Values ('j2ee',' bsmith@gmail.com')

resumerecord1.......john smith, jsmith@gmail.com, address, city, state, zip, livelink, java

resumerecord2 ......bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee

resumerecord3......john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap

Imagine you are a Livelink candidate applying for a Livelink position.

You would send me your resume, typically many pages.

I receive your resume, via email, convert to a text file and import into my SQL DB, using SSIS while parsing out your email address.

This gives me:

Your Email address

Your Resume

When i get future Livelink positons that you may be interested in, i search my resume DB for Livelink candidates using Contains "Livelink" since it is FTI.

Your email would pop up and i could send you the job description from the email address.

Since the database has duplicates, i dont want to send you 2 emails and make you mad !

Better ?

Thanks

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 13

ifila (7/2/2009)
I will grab the first record - in this case 'livelink, java'

The one with the lowest ID?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jeffrey Williams SSC Guru Points: 90304 More actions · Answer 14

Okay, that really doesn't help much. In your previous post you stated this:

MY data:
john smith, jsmith@gmail.com, address, city, state, zip, livelink, java
bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee
john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap

This does not match up with your table definition that contains 3 columns. Based upon your last post, and those three columns - you can try the following:

;WITH cteDUPS (ID, Email, Resume, rn)

AS (SELECT ID3, Email1, OriginalResume, row_number() OVER(PARTITION BY Email1 ORDER BY ID) AS rn FROM dbo.ResumeIN)

SELECT * FROM cteDUPS WHERE rn = 1;

Just a guess on what you have supplied so far.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs