Need to remove duplicate records and get distinct values

Question

Need to remove duplicate records and get distinct values

Viewing 14 posts - 46 through 58 (of 58 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 1

karthikeyan (7/17/2008)
What size column do you thing NAME = '' will make?
Jeff,
Thanks for highlighting this area.
I have executed my previous code.
select distinct ID, Name =''
into #t1
from T
update #t1
set #t1.Name = T.Name
from T inner join #t1
on #t1.ID =T.ID
I got the below output:
ID Name
----------- ----
1 J
2 M
3 K

When I executed your code, I got two errors... the first one was about not being able to create a zero length column because of the Name=''.

The second one was because there was no column alias on the SET NAME = NAME statement.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

charugoelster Grasshopper Points: 14 More actions · Answer 2

select id,name from tablename

where id in (select id from tablename group by id having count(id)>1)

OR

select id,name from tablename group by

id having count(id) > 1

OR (For ALL columns Contains Same Values in rows)

select * from tablename group by

col1,col2.....coln having count(*) > 1

charugoelster Grasshopper Points: 14 More actions · Answer 3

select id,name from tablename

where id in (select id from tablename group by id having count(id)>1)

OR

select id,name from tablename group by

id having count(id) > 1

OR (For ALL columns Contains Same Values in rows)

select * from tablename group by

col1,col2.....coln having count(*) > 1

Charu 😛

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 4

karthikeyan (7/17/2008)
Thanks Chris !
I am not offending you, i am asking this question politely.
Jeff's Sample code:
--===== Right way to do a "Joined" update
UPDATE TableA
SET SomeTableAColumn = b.SomeTableBColumn
FROM TableB b
INNER JOIN TableA a
ON a.SomeOtherColumn = b.SomeOtherColumn
My query:
update #t1
set #t1.Name = T.Name
from T
inner join #t1
on #t1.ID =T.ID
Your code:
update #t1
set #t1.Name = T.Name
from #t1 inner join T
on #t1.ID =T.ID
Parse and Compile Time 0.
SQL Server cpu time: 0 ms.
Execution Time 0.
SQL Server cpu time: 0 ms. SQL Server elapsed time: 13 ms.
(3 rows affected)
I think you have changed the table order from the FROM clause.
Again i am asking politely,which query will perform good ?

All 3 of these are fine... it's an inner join...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 5

karthikeyan (7/17/2008)
I have executed all these three methods more than 10 times, But i got the same result at all time.
1) update #t1
set #t1.Name = T.Name
from T , #t1
where #t1.ID =T.ID
2) update #t1
set #t1.Name = T.Name
from T inner join #t1
on #t1.ID =T.ID
3) update #t1
set #t1.Name = T.Name
from #t1 inner join T
on #t1.ID =T.ID
ID Name
----------- ----
1 John
2 Mary
3 Kumar
How ?

They're all a form of inner joins... they make no difference to the Server. Execution plan for all 3 will be identical. That's how.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

I've got three pennies in my pocket. I've counted them more than ten times now, facing different directions, but each time I get the same result. Three pennies.

I'm baffled.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

d_sysuk SSC Eights! Points: 809 More actions · Answer 7

I noticed that we didnt really analyse all solutions that are best for performance

(Just looked at the solutions that were all the same)

- I'll recall from the thread: [From the first part of the request]

1) We have a solution with a temp table

2) We have a solution with a select id, max(value) from <>

..

may have been other solutions, but I havn't yet seen the difference between the temp table / the group by solution yet.

As I mentioned earlier mileage probably varies, based on indexing, table size, resetting of cache, as perf. testing is an art.

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 8

d_sysuk (7/16/2008)
Okay so the posters had part I published several times, with various answers... using min/max, a subquery, a temp-table.
The perfm. analysis can be made easily by Sam/others just run them across (expect the group by will be higher perf than temp tables, subquery join - but thats just an untested hypothesis - also depends on indexing of the subtable etc...[e.g. ID - clustered, value - non-clustered, or would a covering index be better here.. etc, I'm sure it will all affect the results])
Anyone up for programming PART II of the request from Sam?
I certainly know how to do it, want to see who comes up with a nice solution [Mine would involve two loops].
[SQL Server 2005 solution is neat and tidy using rank function].

I already posted a link to a couple of possible solutions... if you'd like to post your double loop solution, I'll give two of the other solutions that work in SQL Server 2000 a shot and post them along with performance testing of the 3 methods...

I also posted a million row data generator for this earlier hoping that Karthik would post some test code to answer his own earlier performance question. I'll use that as the basis for testing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

d_sysuk SSC Eights! Points: 809 More actions · Answer 9

Well I posted a SQL 2005 solution to the question (which the poster seems to fit fine, as this is really a sql 2005 solution not sql7/2000 - yes posted in wrong forum). I cannot see the point in posting a hard second part solution when sql 2005 is the target platform here.. shame I saw nobody else solve Part II !

Anyways, from my earlier posts - here it is consilidated:

PART I

select id, min(value) value

from

(select '1' id,'br' value

union

select '1','er'

union

select '2','rt'

union

select '2','uj'

union

select '3','iu'

union

select '3','ol') data

group by id

PART II:

select distinct id,value, rank() over (partition by id order by value)

from

(select '1' id,'br' value

union

select '1','er'

union

select '2','rt'

union

select '2','uj'

union

select '3','iu'

union

select '3','ol') data

Output:

id value (No column name)

1 br 1

1 er 2

2 rt 1

2 uj 2

3 iu 1

3 ol 2

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 10

d_sysuk (7/21/2008)
shame I saw nobody else solve Part II !

No need to gloat... No sense in someone else solving it... you already did it! 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

d_sysuk SSC Eights! Points: 809 More actions · Answer 11

Performed a quick perf. analysis, as suspected.. here are the results:

Query 1: [K's]

select distinct ID, Name = CONVERT(varchar(50),'')

into #t1

from T1

update #t1

set #t1.Name = T1.Name

from T1 inner join #t1

on #t1.ID =T1.ID

select * from #t1

-------------------------------------------------------

Query 2:

select id, min(name) value

from

T1

group by id

--------------------------------------------------------

Query 3:

select id, max(name) value

from

T1

group by id

--------------------------------------------------------

All results on T1 (1 Million row record set, clustered index on id (asc),name (asc), - max value of ID : 10,000 ; name unique - ratio of dup IDs: 100:1)

Swapping the clustered PK to id (asc), name (desc) - the results from query 2 / 3 swap - showing the index makes a difference on performance to which operator should be chosen (min/max) - approx 13% diff [test results not included here].

All runs had set statistics io/ time on.

All runs had the following run before each query was tested to reset the cache (we reset the cache for each query, and ran each query individually):

DBCC DROPCLEANBUFFERS

dbcc FREEPROCCACHE

dbcc FREESYSTEMCACHE ( 'ALL' )

dbcc FREESESSIONCACHE

Query 1: [Using temp table]

-- First Part: Select distinct part:

SQL Server parse and compile time:

CPU time = 15 ms, elapsed time = 611 ms.

Table 'T1'. Scan count 1, logical reads 3190, physical reads 1, read-ahead reads 3185, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 297 ms, elapsed time = 1069 ms.

-- Second Part: Update part:

SQL Server parse and compile time:

CPU time = 15 ms, elapsed time = 68 ms.

Table '#t1_________________________________________________________________________________________________________________0000000000FF'. Scan count 1, logical reads 27068, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'T1'. Scan count 1, logical reads 3190, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 985 ms, elapsed time = 1589 ms.

--Third Part: Display output:

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 1 ms.

Table '#t1_________________________________________________________________________________________________________________0000000000FF'. Scan count 1, logical reads 8583, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 138 ms.

Query 2: [min version]

Table 'T1'. Scan count 1, logical reads 3190, physical reads 1, read-ahead reads 3185, lob logical reads 0,

lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 422 ms, elapsed time = 980 ms.

Query 3: [max version]

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 163 ms.

Table 'T1'. Scan count 1, logical reads 3190, physical reads 1, read-ahead reads 3185, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 469 ms, elapsed time = 1231 ms.

karthik M One Orange Chip Points: 29139 More actions · Answer 12

d_sysuk (7/16/2008)
--------------------------------------------------------------------------------
Okay so the posters had part I published several times, with various answers... using min/max, a subquery, a temp-table.
The perfm. analysis can be made easily by Sam/others just run them across (expect the group by will be higher perf than temp tables, subquery join - but thats just an untested hypothesis - also depends on indexing of the subtable etc...[e.g. ID - clustered, value - non-clustered, or would a covering index be better here.. etc, I'm sure it will all affect the results])
Anyone up for programming PART II of the request from Sam?
I certainly know how to do it, want to see who comes up with a nice solution [Mine would involve two loops].
[SQL Server 2005 solution is neat and tidy using rank function].
I already posted a link to a couple of possible solutions... if you'd like to post your double loop solution, I'll give two of the other solutions that work in SQL Server 2000 a shot and post them along with performance testing of the 3 methods...
I also posted a million row data generator for this earlier hoping that Karthik would post some test code to answer his own earlier performance question. I'll use that as the basis for testing.

Jeff,

Sorry ! i was little busy last week. I will come up with test code within couple of days.

karthik

kannan-336353 Mr or Mrs. 500 Points: 566 More actions · Answer 13

kannan-336353

Mr or Mrs. 500

Points: 566

July 30, 2008 at 6:39 am

#850425

Use DISTINCT..!

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 14

kannan (7/30/2008)
Use DISTINCT..!

Sure... let's see your code example... 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)