March 26, 2009 at 6:38 am
let me explain you again.
e.g.
If there are 10 rows then its possible to have 4 rows with one same question and 6 other rows with another same question.
then there will be only two distinct rows but in the table there are ten rows.
I need to find out those two distinct rows and delete other 8 rows.
Thankyou very much for your kind response I will be glad to take more response from you.
Musab
http://www.sqlhelpline.com
March 26, 2009 at 7:00 am
Chris has it. You should be able to use the PK value as an ordering clause. Delete the rows where the ID is not equal to the top ID ordered by the ID descending where the two values you want to check are equal.
This should work much better than any kind of DISTINCT clause.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 26, 2009 at 7:10 am
Can you run this query and post the results, please? You may wish to change sensitive data but if you do, please explain exactly which columns have been changed.
SELECT TOP 10
[ID],
[UserID],
[VisitorID],
SUBSTRING([Text], 1, 10) AS [Text], -- [nvarchar](max) NULL
[TimeStamp],
[ViewCount],
LEFT([MetaKeywords], 10) AS [MetaKeywords],
LEFT([MetaDescription], 10) AS [MetaDescription],
LEFT([PageURL], 10) AS [PageURL],
[CategoryID],
[LocationID],
[Status],
[MisplacedCount],
[SpamCount],
[InappropriateCount],
[MarkedAsMisplaced],
[MarkedAsSpam],
[MarkedAsInappropriate],
[DiscussionCount],
[ExpertCount],
LEFT([IPAddress], 10) AS [IPAddress],
LEFT([ProxyAddress], 10) AS [ProxyAddress],
[IsFeatured],
[ExpertID],
LEFT([ExpertName], 10) AS [ExpertName],
LEFT([ExpertURL], 10) AS [ExpertURL],
[Source], --
LEFT([SourceURL], 10) AS [SourceURL],
SUBSTRING([QuestionDescription], 1, 10) AS [QuestionDescription], -- [nvarchar](max) NULL,
SUBSTRING([ListingDescription], 1, 10) AS [ListingDescription], -- [nvarchar](max) NULL,
[SuckerID],
[SuckerType],
[LastDiscussedTimeStamp],
[RowVer],
[IsUpdatedByAdmin],
[LastUpdatedTimestamp]
FROM [dbo].[Question]
ORDER BY [TimeStamp] DESC
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 26, 2009 at 7:12 am
sorry I am unable to get your point please explain again !
Musab
http://www.sqlhelpline.com
March 26, 2009 at 7:18 am
ID,UserID,VisitorID,Text,TimeStamp,ViewCount,MetaKeywords,MetaDescription,PageURL,CategoryID,LocationID,Status,MisplacedCount,SpamCount,InappropriateCount,MarkedAsMisplaced,MarkedAsSpam,MarkedAsInappropriate,DiscussionCount,ExpertCount,IPAddress,ProxyAddress,IsFeatured,ExpertID,ExpertName,ExpertURL,Source,SourceURL,QuestionDescription,ListingDescription,SuckerID,SuckerType,LastDiscussedTimeStamp,RowVer,IsUpdatedByAdmin,LastUpdatedTimestamp
19311982,NULL,135879,how 2 make,2009-02-06 01:53:01.327,4,NULL,NULL,how-2-make,NULL,1061209,1,0,0,0,0,0,0,0,2,117.96.132,117.96.132,0,NULL,NULL,NULL,NULL,NULL,on weblo, ,on weblo, ,NULL,NULL,NULL,0x00000000066827D9,0,NULL
19311981,NULL,135696,Mashrafe M,2009-02-06 01:44:24.553,3,NULL,NULL,Mashrafe-M,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Mashrafe M,Mashrafe M,NULL,NULL,NULL,0x00000000066827AF,0,NULL
19311980,159,NULL,A strange ,2009-02-06 01:38:18.527,3,NULL,NULL,A-strange-,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Sea life -,Sea life -,NULL,NULL,NULL,0x00000000066827A6,0,NULL
19311979,NULL,135696,what is so,2009-02-06 01:32:20.053,2,NULL,NULL,what-is-so,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Mashrafe M,Mashrafe M,NULL,NULL,NULL,0x0000000006682765,0,NULL
19311978,NULL,135696,Mashrafe M,2009-02-06 01:28:25.220,3,NULL,NULL,Mashrafe-M,NULL,1841404,1,0,0,0,0,0,0,1,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Mashrafe M,Mashrafe M,NULL,NULL,2009-02-06 01:41:00.000,0x0000000006682799,0,NULL
19311977,159,NULL,which site,2009-02-06 01:19:25.970,11,NULL,NULL,which-site,NULL,1841404,1,0,0,0,0,0,0,0,4,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Internet -,Internet -,NULL,NULL,NULL,0x000000000668277D,1,2009-02-06 01:22:11.533
19311976,NULL,135479,Shocking b,2009-02-06 01:10:12.920,13,NULL,NULL,Shocking-b,NULL,1841404,1,0,0,0,0,0,0,3,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Shocking b,Shocking b,NULL,NULL,2009-02-06 01:24:00.000,0x0000000006682781,0,NULL
19311975,159,NULL,Great Esca,2009-02-06 01:09:09.870,9,NULL,NULL,Great-Esca,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Crime - Ja,Crime - Ja,NULL,NULL,NULL,0x0000000006682740,0,NULL
19311974,NULL,135479,ICL is bet,2009-02-06 01:08:08.373,9,NULL,NULL,ICL-is-bet,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,ICL is bet,ICL is bet,NULL,NULL,NULL,0x0000000006682786,0,NULL
19311973,159,NULL,Does Obama,2009-02-06 01:05:52.947,10,NULL,NULL,Does-Obama,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Politics -,Politics -,NULL,NULL,NULL,0x000000000668277F,0,NULL
This is a CSV Data
Musab
http://www.sqlhelpline.com
March 26, 2009 at 7:20 am
sorry I am unable to get your point please explain again !
Check the [ID] column in your table. Is it unique or does it have duplicates?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 26, 2009 at 7:20 am
The row with Text Mashrafe M is a duplicate but its like some of the exceptions as the data on both the rows in not identical but I just need to find the Text + Source Unique values and if there is any other change in any column i just need to ignore it. I only need is a unique combination.
Musab
http://www.sqlhelpline.com
March 26, 2009 at 7:24 am
Its a PK.
Its unique
Musab
http://www.sqlhelpline.com
March 26, 2009 at 8:16 am
Here's some sample data:
DROP TABLE #Question
CREATE TABLE #Question(
[ID] [bigint],
[UserID] [bigint] NULL,
[VisitorID] [bigint] NULL,
[Text] [varchar](20) NULL,
[TimeStamp] [datetime] NOT NULL,
[ViewCount] [bigint] NOT NULL,
[MetaKeywords] [nvarchar](50) NULL,
[MetaDescription] [nvarchar](50) NULL,
[PageURL] [nvarchar](50) NULL,
[CategoryID] [bigint] NULL,
[LocationID] [bigint] NULL,
[Status] [int] NOT NULL,
[MisplacedCount] [int] NOT NULL,
[SpamCount] [int] NOT NULL,
[InappropriateCount] [int],
[MarkedAsMisplaced] [bit] NOT NULL,
[MarkedAsSpam] [bit] NOT NULL,
[MarkedAsInappropriate] [bit],
[DiscussionCount] [bigint] NOT NULL,
[ExpertCount] [bigint] NOT NULL,
[IPAddress] [varchar](50) NULL,
[ProxyAddress] [varchar](50) NULL,
[IsFeatured] [bit] NULL,
[ExpertID] [bigint] NULL,
[ExpertName] [nvarchar](100) NULL,
[ExpertURL] [varchar](50) NULL,
[Source] [smallint] NULL,
[SourceURL] [varchar](50) NULL,
[QuestionDescription] [nvarchar](20) NULL,
[ListingDescription] [nvarchar](20) NULL,
[SuckerID] [bigint] NULL,
[SuckerType] [bigint] NULL,
[LastDiscussedTimeStamp] [datetime] NULL,
[RowVer] VARCHAR(18), --[timestamp] NULL,
[IsUpdatedByAdmin] [bit],
[LastUpdatedTimestamp] [datetime] NULL)
INSERT INTO #Question (
[ID],
[UserID],
[VisitorID],
[Text], -- [nvarchar](max) NULL
[TimeStamp],
[ViewCount],
[MetaKeywords],
[MetaDescription],
[PageURL],
[CategoryID],
[LocationID],
[Status],
[MisplacedCount],
[SpamCount],
[InappropriateCount],
[MarkedAsMisplaced],
[MarkedAsSpam],
[MarkedAsInappropriate],
[DiscussionCount],
[ExpertCount],
[IPAddress],
[ProxyAddress],
[IsFeatured],
[ExpertID],
[ExpertName],
[ExpertURL],
[Source], ---
[SourceURL],
[QuestionDescription], -- [nvarchar](max) NULL,
[ListingDescription], -- [nvarchar](max) NULL,
[SuckerID],
[SuckerType],
[LastDiscussedTimeStamp],
[RowVer],
[IsUpdatedByAdmin],
[LastUpdatedTimestamp])
SELECT 19311982,NULL,135879,'how 2 make','2009-02-06 01:53:01.327',4,NULL,NULL,'how-2-make',NULL,1061209,1,0,0,0,0,0,0,0,2,'117.96.132','117.96.132',0,NULL,NULL,NULL,NULL,NULL,'on weblo, ','on weblo, ',NULL,NULL,NULL,'0x00000000066827D9',0,NULL UNION ALL
SELECT 19311981,NULL,135696,'Mashrafe M','2009-02-06 01:44:24.553',3,NULL,NULL,'Mashrafe-M',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Mashrafe M','Mashrafe M',NULL,NULL,NULL,'0x00000000066827AF',0,NULL UNION ALL
SELECT 19311980,159,NULL,'A strange ','2009-02-06 01:38:18.527',3,NULL,NULL,'A-strange-',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Sea life -','Sea life -',NULL,NULL,NULL,'0x00000000066827A6',0,NULL UNION ALL
SELECT 19311979,NULL,135696,'what is so','2009-02-06 01:32:20.053',2,NULL,NULL,'what-is-so',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Mashrafe M','Mashrafe M',NULL,NULL,NULL,'0x0000000006682765',0,NULL UNION ALL
SELECT 19311978,NULL,135696,'Mashrafe M','2009-02-06 01:28:25.220',3,NULL,NULL,'Mashrafe-M',NULL,1841404,1,0,0,0,0,0,0,1,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Mashrafe M','Mashrafe M',NULL,NULL,'2009-02-06 01:41:00.000','0x0000000006682799',0,NULL UNION ALL
SELECT 19311977,159,NULL,'which site','2009-02-06 01:19:25.970',11,NULL,NULL,'which-site',NULL,1841404,1,0,0,0,0,0,0,0,4,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Internet -','Internet -',NULL,NULL,NULL,'0x000000000668277D',1,'2009-02-06 01:22:11.533' UNION ALL
SELECT 19311976,NULL,135479,'Shocking b','2009-02-06 01:10:12.920',13,NULL,NULL,'Shocking-b',NULL,1841404,1,0,0,0,0,0,0,3,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Shocking b','Shocking b',NULL,NULL,'2009-02-06 01:24:00.000','0x0000000006682781',0,NULL UNION ALL
SELECT 19311975,159,NULL,'Great Esca','2009-02-06 01:09:09.870',9,NULL,NULL,'Great-Esca',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Crime - Ja','Crime - Ja',NULL,NULL,NULL,'0x0000000006682740',0,NULL UNION ALL
SELECT 19311974,NULL,135479,'ICL is bet','2009-02-06 01:08:08.373',9,NULL,NULL,'ICL-is-bet',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'ICL is bet','ICL is bet',NULL,NULL,NULL,'0x0000000006682786',0,NULL UNION ALL
SELECT 19311973,159,NULL,'Does Obama','2009-02-06 01:05:52.947',10,NULL,NULL,'Does-Obama',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Politics -','Politics -',NULL,NULL,NULL,'0x000000000668277F',0,NULL
Now that we can see your data, it's much easier to figure out a solution.
SELECT [Text], Source
FROM #Question
ORDER BY [Text]
Results:
Text Source
-------------------- ------
A strange NULL
Does Obama NULL
Great Esca NULL
how 2 make NULL
ICL is bet NULL
Mashrafe M NULL
Mashrafe M NULL
Shocking b NULL
what is so NULL
which site NULL
(10 row(s) affected)
It doesn't look like the column [Source] is of much use to you.
Now, if the two rows with value 'Mashrafe M' are duplicates, is there any other column which is duplicated? UserID & VisitorID look to me like very good candidates:
SELECT Text, UserID, VisitorID
FROM #Question
ORDER BY [Text]
Results:
Text UserID VisitorID
-------------------- -------------------- --------------------
A strange 159 NULL
Does Obama 159 NULL
Great Esca 159 NULL
how 2 make NULL 135879
ICL is bet NULL 135479
Mashrafe M NULL 135696
Mashrafe M NULL 135696
Shocking b NULL 135479
what is so NULL 135696
which site 159 NULL
(10 row(s) affected)
What do you reckon?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 26, 2009 at 12:40 pm
musab (3/26/2009)
sorry I am unable to get your point please explain again !
Sorry, you ought to be able do something like the following. There are probably better ways, but this came out the first time I tried it. I'm not writing a DELETE, but if you can SELECT the right data,you can delete the right data too.
SELECT a.ID
FROM dbo.QUESTION a
JOIN dbo.QUESTION b
ON a.[Text] = b.[Text]
AND a.Source = b.Source
AND a.ID = b.id
AND b.ID > (SELECT TOP(1) b2.ID
FROM Question b2
WHERE b2.Text = b.Text
and b2.Source = b.Source
ORDER BY b2.ID ASC)
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 2, 2009 at 2:15 am
Hi,
You might also like to consider adding two INT columns to hold CHECKSUMs of the Questions and Answers.
The CHECKSUMs can act as a hash value to allow a future query to quickly select rows which are probably duplicates.
Having identified the relatively small set of possible duplicates on an efficient INT column, you then go on to check the whole string for an exact match. Only having to compare a small fraction of the complete strings makes an amazing performance difference!
In case you are not familiar with this concept, here is a simple script to demonstrate it:
-- Create a simple table with questions and answers and two hash key columns
CREATE TABLE dbo.StringDup
(
PKINTIDENTITY(1,1) CONSTRAINT PK_SD PRIMARY KEY CLUSTERED,
QuestionNVARCHAR(MAX) NULL,
AnswerNVARCHAR(MAX) NULL,
Q_HashINT NULL,
A_HashINT NULL
);
-- Add 123,456 semi-random text rows, so there will be some duplicates, but most rows will be unique
-- (on a typical run anyway)
SET NOCOUNT ON
DECLARE@i INT
SET @i = 0
WHILE @i < 123456
BEGIN
;WITH C AS (SELECT CONVERT(INT, RAND(CHECKSUM(NEWID())) * 1000000) AS Q, CONVERT(INT, RAND(CHECKSUM(NEWID())) * 1000000) AS A)
INSERT dbo.StringDup (Question, Answer)
SELECT N'This is question #' + CAST(C.Q AS NVARCHAR(7)), N'This is answer #' + CAST(C.A AS NVARCHAR(7))
FROM C;
SET @i = @i + 1;
END;
-- Calculate the hash keys
-- This is very fast if the Questions and Answers are still in cache
-- Use a real column - not a persisted computed one, if you are tempted (pain lies that way)
UPDATEdbo.StringDup
SETQ_Hash = CHECKSUM(Question),
A_Hash = CHECKSUM(Answer);
-- Create two small indexes on the hash keys only
-- Don't be tempted to INCLUDE the long text columns in this index
-- The point is that the index is small and dead fast
CREATE INDEX IX_QHASH ON dbo.StringDup (Q_Hash);
CREATE INDEX IX_AHASH ON dbo.StringDup (A_Hash);
-- The CTE contains the hash keys which appear more than once
-- The join then retrieves rows with those hash keys where the Question is also a duplicate
;WITH Dup AS (SELECT Q_Hash FROM dbo.StringDup GROUP BY Q_Hash HAVING COUNT_BIG(*) > 1)
SELECTS.Question, COUNT_BIG(*) AS dup_cnt
FROMDup
INNER
LOOP
JOINdbo.StringDup AS S WITH(INDEX(IX_QHASH))
ON(Dup.Q_Hash = S.Q_Hash)
GROUPBY
S.Question
HAVINGCOUNT_BIG(*) > 1 -- Question string appears more than once
ORDERBY
dup_cnt DESC;
The order by can be removed if you don't need it.
The code for finding duplicate Answers is spookily similar.
The loop join and index hint are there to ensure the optimizer gets the message, and doesn't retrieve the long string column to soon.
Have fun!
/Paul
April 2, 2009 at 4:10 am
I am really thankful to you it really solved my problem. Thank you !
Musab
http://www.sqlhelpline.com
April 2, 2009 at 4:11 am
I am really thankful to you it really solved my problem. Thank you !
I will be looking for more from you thanks.
Musab
http://www.sqlhelpline.com
April 2, 2009 at 11:26 am
Hi
It would be nice if you tell us final solution/implementation.
April 2, 2009 at 11:46 pm
Paul White gave the best solutions which is implementable on Large table in a very efficient way.
if you need any kind of explanation i will be more than happy to give you it.
Musab
http://www.sqlhelpline.com
Viewing 15 posts - 16 through 30 (of 33 total)
You must be logged in to reply to this topic. Login to reply