July 2, 2009 at 8:18 am
Does anyone have any code to ignore duplicate records ?
I need to compare the current value of a field, with the next record value of the same field, and if they are the same, move to the next record, until they are not the same, all within the same table.
Just using 'Distinct' or 'Group By' within a Select statement is not giving me what i need !
I know duplicates are bad, but i dont want to delete my duplicate records just yet.
Thanks
July 2, 2009 at 8:23 am
How do you mean 'next' is this is a cursor ?
you could use row_number() (partition by .....) to detect the 'duplicates'
July 2, 2009 at 8:40 am
What i mean is:
record1 - email, other fileds
record2 - email, other fields
record3 - email, other fields
If record2.email = record1.email NEXT
If record3.email = record1.email NEXT
etc ..........
Thanks
July 2, 2009 at 9:05 am
Could you explain in more detail what it is that you're trying to do?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 2, 2009 at 9:20 am
I link to an article in my signature that will show you how to post your question and get better answers faster. Please do yourself a favor and review that article and repost your question.
Your question does not make sense, since a table in SQL Server is unordered. The only way order is defined is by using the ORDER BY clause in a query.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
July 2, 2009 at 9:21 am
I have a bunch of resumes in my database, which include duplicates.
ResumeData Table
resume
What i am trying to do is create a DB VIEW of the same data, without the duplicates.
ResumeDate VIEW
resume
eg. I have 71 Livelink records
When i run Select * - i get 71 records
When i run Select distinct email - i get 35 records (which obviously mean i have 35 unique email addresses)
When i run Select distinct email, resume - i get 55 records (which is NOT what i want)
I want my view to have 35 emails and resumes.
Is this clearer ?
Thanks
July 2, 2009 at 9:30 am
From that I'd conclude you have more then one resume for some of the emails. In those cases, which resume do you want back?
Please post table definitions, sample data and desired output. Read this to see the best way to post this to get quick responses.
http://www.sqlservercentral.com/articles/Best+Practices/61537/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 2, 2009 at 9:33 am
Nope, not really clearer at all. How are you going to decide which combination of Resume/Email is the one you want? How do you know it is the right one?
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
July 2, 2009 at 9:52 am
CREATE TABLE resumein
(
ID3 INT IDENTITY(1,1) NOT NULL,
OriginalResume nvarchar(max) NULL,
Email1 nvarchar(1000) NULL,
CONSTRAINT PK_ID3 PRIMARY KEY CLUSTERED (ID3)
)
GO
MY data:
john smith, jsmith@gmail.com, address, city, state, zip, livelink, java
bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee
john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap
Since i have 2 'john smith' records with the same email address, i would like my DB VIew to look like this.
john smith, jsmith@gmail.com, address, city, state, zip, livelink, java
bob smith, bsmith@gmail.com, address, city, state, zip, livelink, j2ee
If you notice, one john smith resume has 'livelink, java' and another has 'livelink, sap', that is i want to seperate based on email alone !
better ?
Thanks
July 2, 2009 at 10:06 am
Better - but still not quite there yet. How do you decide which row you are going to return? How are you deciding that you want 'livelink, java' as opposed to 'livelink, sap'?
Are those separate columns? Such that you could have a row with 'mylink' and 'java' - or would that be in the same column as 'mylink, java'?
If separate columns, how are you going to decide which one you want? Livelink or mylink?
Now, what about the email column? What happens if you have a row for John Smith with a different email address? Which one do you want?
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
July 2, 2009 at 10:13 am
Your sample data doesn't fit in the table specified. Do you mean this as sample data, or should the table have more columns?
Insert into resumein (OriginalResume, OriginalEmail)
Values ('java',' jsmith@gmail.com')
Insert into resumein (OriginalResume, OriginalEmail)
Values ('SAP',' jsmith@gmail.com')
Insert into resumein (OriginalResume, OriginalEmail)
Values ('j2ee',' bsmith@gmail.com')
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 2, 2009 at 10:30 am
Jeffrey Williams (7/2/2009)
Better - but still not quite there yet. How do you decide which row you are going to return? How are you deciding that you want 'livelink, java' as opposed to 'livelink, sap'?I will grab the first record - in this case 'livelink, java'
Are those separate columns?
Yes 3 columns
ID
resume
Such that you could have a row with 'mylink' and 'java' - or would that be in the same column as 'mylink, java'?
same column - resume
Now, what about the email column? What happens if you have a row for John Smith with a different email address? Which one do
you want?
I want EACH unique email address.
Thanks
July 2, 2009 at 10:40 am
GilaMonster (7/2/2009)
Your sample data doesn't fit in the table specified. Do you mean this as sample data, or should the table have more columns?
Insert into resumein (OriginalResume, OriginalEmail)
Values ('java',' jsmith@gmail.com')
Insert into resumein (OriginalResume, OriginalEmail)
Values ('SAP',' jsmith@gmail.com')
Insert into resumein (OriginalResume, OriginalEmail)
Values ('j2ee',' bsmith@gmail.com')
resumerecord1.......john smith, jsmith@gmail.com, address, city, state, zip, livelink, java
resumerecord2 ......bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee
resumerecord3......john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap
Imagine you are a Livelink candidate applying for a Livelink position.
You would send me your resume, typically many pages.
I receive your resume, via email, convert to a text file and import into my SQL DB, using SSIS while parsing out your email address.
This gives me:
Your Email address
Your Resume
When i get future Livelink positons that you may be interested in, i search my resume DB for Livelink candidates using Contains "Livelink" since it is FTI.
Your email would pop up and i could send you the job description from the email address.
Since the database has duplicates, i dont want to send you 2 emails and make you mad !
Better ?
Thanks
July 2, 2009 at 10:43 am
ifila (7/2/2009)
I will grab the first record - in this case 'livelink, java'
The one with the lowest ID?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 2, 2009 at 10:45 am
Okay, that really doesn't help much. In your previous post you stated this:
MY data:
john smith, jsmith@gmail.com, address, city, state, zip, livelink, java
bob smith, bsmith@gmal.com, address, city, state, zip, livelink, j2ee
john smith, jsmith@gmail.com, address, city, state, zip, livelink, sap
This does not match up with your table definition that contains 3 columns. Based upon your last post, and those three columns - you can try the following:
;WITH cteDUPS (ID, Email, Resume, rn)
AS (SELECT ID3, Email1, OriginalResume, row_number() OVER(PARTITION BY Email1 ORDER BY ID) AS rn FROM dbo.ResumeIN)
SELECT * FROM cteDUPS WHERE rn = 1;
Just a guess on what you have supplied so far.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
Viewing 15 posts - 1 through 15 (of 26 total)
You must be logged in to reply to this topic. Login to reply