October 31, 2009 at 2:44 pm
Aaron Cabrele (10/31/2009)
It says no additional identity column so that doesn't outrule the hidden one that already exists... As for adding a guid it depends on how strictly you say what is an identity column - is that simply a column with the identity property, therefor any other column like uniqueidentifier can be added?...!I love the way, once most or all possibilities are exhausted, we try and re-interpret the question, bend
the rules...
LOL. Well, the objective is to find the simplest solution within the rules given. As with any game or puzzle, a large part of the solution is knowing what the boundaries really are.
October 31, 2009 at 9:07 pm
Barry, someone sent me this link again, although I'd read it the last time around, and your description of the problem got me to reminiscing. It takes me back to the days of fixed file extents on disk. Probably the primary reason why you had to do an update in place was that the rest of the disk was spoken for. If things were really tight though, you might even have written a loop to de-dupe one set of identical rows at a time. There were so many hardware limitations to work around, and I'd all but forgotten them. Thanks for taking me back to the "bad old days".
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
November 1, 2009 at 5:20 pm
Maybe I'm missing something, but the query the original poster was looking for is simple with the appropriate GROUP BY and HAVING clauses.
SELECT Name, Age, Sex
FROM Source
GROUP BY Name, Age, Sex
HAVING COUNT(*) > 1
This ensures that the output from the query includes only a row for duplicate rows. Note no row will be output for a unique row.
November 2, 2009 at 3:08 am
This would select (or delete) everything that is duplicated. The requirement was to delete only the first occurrence of a duplicate, or the only occurrence of any unique rows.
The "first" is obviously irrelevant given there is no order and they are duplicates so I read that as "one of"
JLSSCH (11/1/2009)
Maybe I'm missing something, but the query the original poster was looking for is simple with the appropriate GROUP BY and HAVING clauses.SELECT Name, Age, Sex
FROM Source
GROUP BY Name, Age, Sex
HAVING COUNT(*) > 1
This ensures that the output from the query includes only a row for duplicate rows. Note no row will be output for a unique row.
November 2, 2009 at 7:16 am
I have not seen this solution in the above list. Not fancy, but performs the initial request to delete duplicate rows leaving the initial entry, with minimal resource impact (especialy disk space) and no cursors, additional table/column, while not modifying the data?????:
DECLARE @NM VARCHAR(50), @Age TINYINT, @Sex CHAR(1);
SELECT @NM=Name, @Age = Age, @Sex = Sex
FROM dbo.xSource
GROUP BY Name, Age, Sex
HAVING COUNT(*) > 1
WHILE @NM IS NOT NULL
BEGIN
DELETE TOP (1)
FROM dbo.xSource
WHERE Name = @NM AND Age = @Age AND Sex = @Sex
SELECT @NM = NULL
SELECT @NM=Name, @Age = Age, @Sex = Sex
FROM dbo.xSource
GROUP BY Name, Age, Sex
HAVING COUNT(*) > 1
END
November 2, 2009 at 8:35 am
Great article. I really liked the 'go back to the 70' part.
Sometimes when I see how stuff is done just because the computer can handle the load it drives me nuts.
---------------------------------------------
[font="Verdana"]Nothing is impossible.
It is just a matter of time and money.[/font]
November 2, 2009 at 9:28 am
Great one....
November 2, 2009 at 10:50 am
Thanks for your comments. The following modified solution works when you use the row_number() function to number the rows in an output query.
For the following query, the output for "row_number() over (order by name) as ident" will produce a column with the name of "ident" and a monotonically increasing integer value starting with 1. This is functionally the same as an identity column with a seed of 1!
select name, age, sex, row_number() over (order by name) as ident
from #temp
Here is the complete script:
create table #temp
(
name varchar(4),
age int,
sex varchar(1)
)
insert #temp
select 'ABC',24,'M'
union all
select 'ABC',24,'M'
union all
select 'LMN', 27, 'M'
union all
select 'LMN', 27, 'M'
union all
select 'LMN', 27, 'M'
union all
select 'PQRS', 25, 'F'
union all
select 'XYZ', 24, 'M'
union all
select 'XYZ', 25, 'M'
select a.name, a.age, a.sex, a.ident
from
(
select name, age, sex, row_number() over (order by name) as ident
from #temp
) a
inner join
(
select b.name, b.age, b.sex, min(b.ident) as ident
from
(
select name, age, sex, row_number() over (order by name) as ident
from #temp
) b
group by b.name, b.age, b.sex
having count(*) > 1
) c on a.name = c.name and a.age = c.age and a.sex = c.sex and a.ident > c.ident
drop table #temp
--***************************
The desired output that is obtained from executing this script is:
ABC24M2
LMN27M4
LMN27M5
I hope that this helps.
November 2, 2009 at 11:41 am
One final update to my solution. Using CTE's, the output query is much easier to understand.
Here is the complete script:
create table #temp
(
name varchar(4),
age int,
sex varchar(1)
)
insert #temp
select 'ABC',24,'M'
union all
select 'ABC',24,'M'
union all
select 'LMN', 27, 'M'
union all
select 'LMN', 27, 'M'
union all
select 'LMN', 27, 'M'
union all
select 'PQRS', 25, 'F'
union all
select 'XYZ', 24, 'M'
union all
select 'XYZ', 25, 'M';
with tmp
as
(
select name, age, sex, row_number() over (order by name) as ident
from #temp
)
select a.name, a.age, a.sex, a.ident
from tmp a inner join
(
select b.name, b.age, b.sex, min(b.ident) as ident
from tmp b
group by b.name, b.age, b.sex
having count(*) > 1
) c on a.name = c.name and a.age = c.age and a.sex = c.sex and a.ident > c.ident
drop table #temp
--***************************
The desired output that is obtained from executing this script is:
ABC 24 M 2
LMN 27 M 4
LMN 27 M 5
November 2, 2009 at 6:09 pm
hmi (10/30/2009)
Good brain training but are the restrictions as mentioned realistic? Which company would definitely not want to use a temp table or table variable?
Quite right, it's really just brain training. 🙂
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
November 2, 2009 at 6:26 pm
honza.mf (10/30/2009)
Phil Factor (8/5/2008)
Surely this is the simplest solution? (be warned of a nasty catch when creating these 'quirky updates'. the order of execution of the update is -variable assignments first then column assignments-both left to right!)Very nice solution.
I Just prefer little change, it's more readable
[font="Courier New"]--create the sample table
DECLARE @Sample TABLE (TheName VARCHAR(4),Identifier INT, sex VARCHAR(1))
INSERT INTO @Sample (TheName,Identifier,Sex)
SELECT 'ABC', 24, 'M' UNION ALL
SELECT 'ABC', 24,'M' UNION ALL
SELECT 'LMN', 27, 'M' UNION ALL
SELECT 'LMN', 27, 'M' UNION ALL
SELECT 'LMN', 27, 'M' UNION ALL
SELECT 'PQRS', 25, 'F' UNION ALL
SELECT 'XYZ', 24, 'M' UNION ALL
SELECT 'XYZ', 25, 'M'
DECLARE @hash VARCHAR(80), @sex VARCHAR(1)
UPDATE @sample
SET
@sex=CASE WHEN COALESCE(@hash,'')
<>TheName+CONVERT(VARCHAR(5),Identifier)+sex
THEN 'd' ELSE SEX END,
@hash= TheName+CONVERT(VARCHAR(5),Identifier)+sex,
Sex = @sex
DELETE FROM @sample WHERE sex='d'
SELECT * FROM @sample[/font]
The quirky update trick isn't really reliable without a Clustered index. For instance, when I execute you code, thi sis what I get:
TheName Identifier sex
------- ----------- ----
ABC 24 M
LMN 27 M
LMN 27 M
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
November 2, 2009 at 6:29 pm
Stephen.Richardson (10/30/2009)
...-- Step 1 Get list of only DUPLICATE rows
INSERT INTO xSource (Name, Age, Sex)
SELECT '~'+Name, Age, Sex FROM xSource GROUP BY Name, Age, Sex Having COUNT(*) > 1
The problem is that this will fail if someone's name is already using all 50 characters (not unheard of). Otherwise I would have used this approach myself. 🙂
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
November 2, 2009 at 6:30 pm
skjoldtc (10/30/2009)
Thanks for the memories. A time when storage and memory was rare and expensive and you had to really know how the guts of the operating system, disk system, and memory allocation worked. Some of the old techniques are still useful today.Great article and a great solution. 😎
Thanks!
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
November 2, 2009 at 6:35 pm
jghali (10/30/2009)
Funny how this article just came out today... I was actually on a similar thread last week and someone came up with a great solution...My personal opinion the article is interesting but ... hmmm...
How about these solutions from this thread...
http://www.sqlservercentral.com/Forums/Topic793765-145-1.aspx
There's an undocumented identity key for every row of any table that can be used... in one simple delete statement you can delete the duplicate rows...
I was amazed...
Check it out.
Thanks
I am familiar with that thread (in fact I think that I am posted there), however, 1) it came out long after this article was originally published (15 months ago) and 2) those internal IDs are a lot hard to use than you might think, esp. given the restrictions. For instance, how would you propose using them to solve this problem?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
November 2, 2009 at 6:40 pm
arms.dan (10/30/2009)
Your off-hand comment about naming the field "Gender" rather than "Sex" piqued my curiosity. Why? As I understand the words, sex means biological differences: chromosomes and sexual organs. Gender refers to the characteristics a society or culture delineates as masculine or feminine. There's some blurring of the definitions, but I think they generally hold. So why do you prefer Gender for the column label?
In English "Sex" can also be a verb, thus leading to the joke about filling forms where is says "Sex[_]" with "Y". The PC way to stop this is to change the field name to "Gender[_]".
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 15 posts - 136 through 150 (of 156 total)
You must be logged in to reply to this topic. Login to reply