Long running queries when using DISTINCT with Strings

  • let me explain you again.

    e.g.

    If there are 10 rows then its possible to have 4 rows with one same question and 6 other rows with another same question.

    then there will be only two distinct rows but in the table there are ten rows.

    I need to find out those two distinct rows and delete other 8 rows.

    Thankyou very much for your kind response I will be glad to take more response from you.

    Musab
    http://www.sqlhelpline.com

  • Chris has it. You should be able to use the PK value as an ordering clause. Delete the rows where the ID is not equal to the top ID ordered by the ID descending where the two values you want to check are equal.

    This should work much better than any kind of DISTINCT clause.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Can you run this query and post the results, please? You may wish to change sensitive data but if you do, please explain exactly which columns have been changed.

    SELECT TOP 10

    [ID],

    [UserID],

    [VisitorID],

    SUBSTRING([Text], 1, 10) AS [Text], -- [nvarchar](max) NULL

    [TimeStamp],

    [ViewCount],

    LEFT([MetaKeywords], 10) AS [MetaKeywords],

    LEFT([MetaDescription], 10) AS [MetaDescription],

    LEFT([PageURL], 10) AS [PageURL],

    [CategoryID],

    [LocationID],

    [Status],

    [MisplacedCount],

    [SpamCount],

    [InappropriateCount],

    [MarkedAsMisplaced],

    [MarkedAsSpam],

    [MarkedAsInappropriate],

    [DiscussionCount],

    [ExpertCount],

    LEFT([IPAddress], 10) AS [IPAddress],

    LEFT([ProxyAddress], 10) AS [ProxyAddress],

    [IsFeatured],

    [ExpertID],

    LEFT([ExpertName], 10) AS [ExpertName],

    LEFT([ExpertURL], 10) AS [ExpertURL],

    [Source], --

    LEFT([SourceURL], 10) AS [SourceURL],

    SUBSTRING([QuestionDescription], 1, 10) AS [QuestionDescription], -- [nvarchar](max) NULL,

    SUBSTRING([ListingDescription], 1, 10) AS [ListingDescription], -- [nvarchar](max) NULL,

    [SuckerID],

    [SuckerType],

    [LastDiscussedTimeStamp],

    [RowVer],

    [IsUpdatedByAdmin],

    [LastUpdatedTimestamp]

    FROM [dbo].[Question]

    ORDER BY [TimeStamp] DESC

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • sorry I am unable to get your point please explain again !

    Musab
    http://www.sqlhelpline.com

  • ID,UserID,VisitorID,Text,TimeStamp,ViewCount,MetaKeywords,MetaDescription,PageURL,CategoryID,LocationID,Status,MisplacedCount,SpamCount,InappropriateCount,MarkedAsMisplaced,MarkedAsSpam,MarkedAsInappropriate,DiscussionCount,ExpertCount,IPAddress,ProxyAddress,IsFeatured,ExpertID,ExpertName,ExpertURL,Source,SourceURL,QuestionDescription,ListingDescription,SuckerID,SuckerType,LastDiscussedTimeStamp,RowVer,IsUpdatedByAdmin,LastUpdatedTimestamp

    19311982,NULL,135879,how 2 make,2009-02-06 01:53:01.327,4,NULL,NULL,how-2-make,NULL,1061209,1,0,0,0,0,0,0,0,2,117.96.132,117.96.132,0,NULL,NULL,NULL,NULL,NULL,on weblo, ,on weblo, ,NULL,NULL,NULL,0x00000000066827D9,0,NULL

    19311981,NULL,135696,Mashrafe M,2009-02-06 01:44:24.553,3,NULL,NULL,Mashrafe-M,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Mashrafe M,Mashrafe M,NULL,NULL,NULL,0x00000000066827AF,0,NULL

    19311980,159,NULL,A strange ,2009-02-06 01:38:18.527,3,NULL,NULL,A-strange-,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Sea life -,Sea life -,NULL,NULL,NULL,0x00000000066827A6,0,NULL

    19311979,NULL,135696,what is so,2009-02-06 01:32:20.053,2,NULL,NULL,what-is-so,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Mashrafe M,Mashrafe M,NULL,NULL,NULL,0x0000000006682765,0,NULL

    19311978,NULL,135696,Mashrafe M,2009-02-06 01:28:25.220,3,NULL,NULL,Mashrafe-M,NULL,1841404,1,0,0,0,0,0,0,1,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Mashrafe M,Mashrafe M,NULL,NULL,2009-02-06 01:41:00.000,0x0000000006682799,0,NULL

    19311977,159,NULL,which site,2009-02-06 01:19:25.970,11,NULL,NULL,which-site,NULL,1841404,1,0,0,0,0,0,0,0,4,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Internet -,Internet -,NULL,NULL,NULL,0x000000000668277D,1,2009-02-06 01:22:11.533

    19311976,NULL,135479,Shocking b,2009-02-06 01:10:12.920,13,NULL,NULL,Shocking-b,NULL,1841404,1,0,0,0,0,0,0,3,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Shocking b,Shocking b,NULL,NULL,2009-02-06 01:24:00.000,0x0000000006682781,0,NULL

    19311975,159,NULL,Great Esca,2009-02-06 01:09:09.870,9,NULL,NULL,Great-Esca,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Crime - Ja,Crime - Ja,NULL,NULL,NULL,0x0000000006682740,0,NULL

    19311974,NULL,135479,ICL is bet,2009-02-06 01:08:08.373,9,NULL,NULL,ICL-is-bet,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,ICL is bet,ICL is bet,NULL,NULL,NULL,0x0000000006682786,0,NULL

    19311973,159,NULL,Does Obama,2009-02-06 01:05:52.947,10,NULL,NULL,Does-Obama,NULL,1841404,1,0,0,0,0,0,0,0,0,172.16.20.,124.109.39,0,NULL,NULL,NULL,NULL,NULL,Politics -,Politics -,NULL,NULL,NULL,0x000000000668277F,0,NULL

    This is a CSV Data

    Musab
    http://www.sqlhelpline.com

  • sorry I am unable to get your point please explain again !

    Check the [ID] column in your table. Is it unique or does it have duplicates?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The row with Text Mashrafe M is a duplicate but its like some of the exceptions as the data on both the rows in not identical but I just need to find the Text + Source Unique values and if there is any other change in any column i just need to ignore it. I only need is a unique combination.

    Musab
    http://www.sqlhelpline.com

  • Its a PK.

    Its unique

    Musab
    http://www.sqlhelpline.com

  • Here's some sample data:

    DROP TABLE #Question

    CREATE TABLE #Question(

    [ID] [bigint],

    [UserID] [bigint] NULL,

    [VisitorID] [bigint] NULL,

    [Text] [varchar](20) NULL,

    [TimeStamp] [datetime] NOT NULL,

    [ViewCount] [bigint] NOT NULL,

    [MetaKeywords] [nvarchar](50) NULL,

    [MetaDescription] [nvarchar](50) NULL,

    [PageURL] [nvarchar](50) NULL,

    [CategoryID] [bigint] NULL,

    [LocationID] [bigint] NULL,

    [Status] [int] NOT NULL,

    [MisplacedCount] [int] NOT NULL,

    [SpamCount] [int] NOT NULL,

    [InappropriateCount] [int],

    [MarkedAsMisplaced] [bit] NOT NULL,

    [MarkedAsSpam] [bit] NOT NULL,

    [MarkedAsInappropriate] [bit],

    [DiscussionCount] [bigint] NOT NULL,

    [ExpertCount] [bigint] NOT NULL,

    [IPAddress] [varchar](50) NULL,

    [ProxyAddress] [varchar](50) NULL,

    [IsFeatured] [bit] NULL,

    [ExpertID] [bigint] NULL,

    [ExpertName] [nvarchar](100) NULL,

    [ExpertURL] [varchar](50) NULL,

    [Source] [smallint] NULL,

    [SourceURL] [varchar](50) NULL,

    [QuestionDescription] [nvarchar](20) NULL,

    [ListingDescription] [nvarchar](20) NULL,

    [SuckerID] [bigint] NULL,

    [SuckerType] [bigint] NULL,

    [LastDiscussedTimeStamp] [datetime] NULL,

    [RowVer] VARCHAR(18), --[timestamp] NULL,

    [IsUpdatedByAdmin] [bit],

    [LastUpdatedTimestamp] [datetime] NULL)

    INSERT INTO #Question (

    [ID],

    [UserID],

    [VisitorID],

    [Text], -- [nvarchar](max) NULL

    [TimeStamp],

    [ViewCount],

    [MetaKeywords],

    [MetaDescription],

    [PageURL],

    [CategoryID],

    [LocationID],

    [Status],

    [MisplacedCount],

    [SpamCount],

    [InappropriateCount],

    [MarkedAsMisplaced],

    [MarkedAsSpam],

    [MarkedAsInappropriate],

    [DiscussionCount],

    [ExpertCount],

    [IPAddress],

    [ProxyAddress],

    [IsFeatured],

    [ExpertID],

    [ExpertName],

    [ExpertURL],

    [Source], ---

    [SourceURL],

    [QuestionDescription], -- [nvarchar](max) NULL,

    [ListingDescription], -- [nvarchar](max) NULL,

    [SuckerID],

    [SuckerType],

    [LastDiscussedTimeStamp],

    [RowVer],

    [IsUpdatedByAdmin],

    [LastUpdatedTimestamp])

    SELECT 19311982,NULL,135879,'how 2 make','2009-02-06 01:53:01.327',4,NULL,NULL,'how-2-make',NULL,1061209,1,0,0,0,0,0,0,0,2,'117.96.132','117.96.132',0,NULL,NULL,NULL,NULL,NULL,'on weblo, ','on weblo, ',NULL,NULL,NULL,'0x00000000066827D9',0,NULL UNION ALL

    SELECT 19311981,NULL,135696,'Mashrafe M','2009-02-06 01:44:24.553',3,NULL,NULL,'Mashrafe-M',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Mashrafe M','Mashrafe M',NULL,NULL,NULL,'0x00000000066827AF',0,NULL UNION ALL

    SELECT 19311980,159,NULL,'A strange ','2009-02-06 01:38:18.527',3,NULL,NULL,'A-strange-',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Sea life -','Sea life -',NULL,NULL,NULL,'0x00000000066827A6',0,NULL UNION ALL

    SELECT 19311979,NULL,135696,'what is so','2009-02-06 01:32:20.053',2,NULL,NULL,'what-is-so',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Mashrafe M','Mashrafe M',NULL,NULL,NULL,'0x0000000006682765',0,NULL UNION ALL

    SELECT 19311978,NULL,135696,'Mashrafe M','2009-02-06 01:28:25.220',3,NULL,NULL,'Mashrafe-M',NULL,1841404,1,0,0,0,0,0,0,1,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Mashrafe M','Mashrafe M',NULL,NULL,'2009-02-06 01:41:00.000','0x0000000006682799',0,NULL UNION ALL

    SELECT 19311977,159,NULL,'which site','2009-02-06 01:19:25.970',11,NULL,NULL,'which-site',NULL,1841404,1,0,0,0,0,0,0,0,4,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Internet -','Internet -',NULL,NULL,NULL,'0x000000000668277D',1,'2009-02-06 01:22:11.533' UNION ALL

    SELECT 19311976,NULL,135479,'Shocking b','2009-02-06 01:10:12.920',13,NULL,NULL,'Shocking-b',NULL,1841404,1,0,0,0,0,0,0,3,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Shocking b','Shocking b',NULL,NULL,'2009-02-06 01:24:00.000','0x0000000006682781',0,NULL UNION ALL

    SELECT 19311975,159,NULL,'Great Esca','2009-02-06 01:09:09.870',9,NULL,NULL,'Great-Esca',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Crime - Ja','Crime - Ja',NULL,NULL,NULL,'0x0000000006682740',0,NULL UNION ALL

    SELECT 19311974,NULL,135479,'ICL is bet','2009-02-06 01:08:08.373',9,NULL,NULL,'ICL-is-bet',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'ICL is bet','ICL is bet',NULL,NULL,NULL,'0x0000000006682786',0,NULL UNION ALL

    SELECT 19311973,159,NULL,'Does Obama','2009-02-06 01:05:52.947',10,NULL,NULL,'Does-Obama',NULL,1841404,1,0,0,0,0,0,0,0,0,'172.16.20.','124.109.39',0,NULL,NULL,NULL,NULL,NULL,'Politics -','Politics -',NULL,NULL,NULL,'0x000000000668277F',0,NULL

    Now that we can see your data, it's much easier to figure out a solution.

    SELECT [Text], Source

    FROM #Question

    ORDER BY [Text]

    Results:

    Text Source

    -------------------- ------

    A strange NULL

    Does Obama NULL

    Great Esca NULL

    how 2 make NULL

    ICL is bet NULL

    Mashrafe M NULL

    Mashrafe M NULL

    Shocking b NULL

    what is so NULL

    which site NULL

    (10 row(s) affected)

    It doesn't look like the column [Source] is of much use to you.

    Now, if the two rows with value 'Mashrafe M' are duplicates, is there any other column which is duplicated? UserID & VisitorID look to me like very good candidates:

    SELECT Text, UserID, VisitorID

    FROM #Question

    ORDER BY [Text]

    Results:

    Text UserID VisitorID

    -------------------- -------------------- --------------------

    A strange 159 NULL

    Does Obama 159 NULL

    Great Esca 159 NULL

    how 2 make NULL 135879

    ICL is bet NULL 135479

    Mashrafe M NULL 135696

    Mashrafe M NULL 135696

    Shocking b NULL 135479

    what is so NULL 135696

    which site 159 NULL

    (10 row(s) affected)

    What do you reckon?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • musab (3/26/2009)


    sorry I am unable to get your point please explain again !

    Sorry, you ought to be able do something like the following. There are probably better ways, but this came out the first time I tried it. I'm not writing a DELETE, but if you can SELECT the right data,you can delete the right data too.

    SELECT a.ID

    FROM dbo.QUESTION a

    JOIN dbo.QUESTION b

    ON a.[Text] = b.[Text]

    AND a.Source = b.Source

    AND a.ID = b.id

    AND b.ID > (SELECT TOP(1) b2.ID

    FROM Question b2

    WHERE b2.Text = b.Text

    and b2.Source = b.Source

    ORDER BY b2.ID ASC)

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi,

    You might also like to consider adding two INT columns to hold CHECKSUMs of the Questions and Answers.

    The CHECKSUMs can act as a hash value to allow a future query to quickly select rows which are probably duplicates.

    Having identified the relatively small set of possible duplicates on an efficient INT column, you then go on to check the whole string for an exact match. Only having to compare a small fraction of the complete strings makes an amazing performance difference!

    In case you are not familiar with this concept, here is a simple script to demonstrate it:

    -- Create a simple table with questions and answers and two hash key columns

    CREATE TABLE dbo.StringDup

    (

    PKINTIDENTITY(1,1) CONSTRAINT PK_SD PRIMARY KEY CLUSTERED,

    QuestionNVARCHAR(MAX) NULL,

    AnswerNVARCHAR(MAX) NULL,

    Q_HashINT NULL,

    A_HashINT NULL

    );

    -- Add 123,456 semi-random text rows, so there will be some duplicates, but most rows will be unique

    -- (on a typical run anyway)

    SET NOCOUNT ON

    DECLARE@i INT

    SET @i = 0

    WHILE @i < 123456

    BEGIN

    ;WITH C AS (SELECT CONVERT(INT, RAND(CHECKSUM(NEWID())) * 1000000) AS Q, CONVERT(INT, RAND(CHECKSUM(NEWID())) * 1000000) AS A)

    INSERT dbo.StringDup (Question, Answer)

    SELECT N'This is question #' + CAST(C.Q AS NVARCHAR(7)), N'This is answer #' + CAST(C.A AS NVARCHAR(7))

    FROM C;

    SET @i = @i + 1;

    END;

    -- Calculate the hash keys

    -- This is very fast if the Questions and Answers are still in cache

    -- Use a real column - not a persisted computed one, if you are tempted (pain lies that way)

    UPDATEdbo.StringDup

    SETQ_Hash = CHECKSUM(Question),

    A_Hash = CHECKSUM(Answer);

    -- Create two small indexes on the hash keys only

    -- Don't be tempted to INCLUDE the long text columns in this index

    -- The point is that the index is small and dead fast

    CREATE INDEX IX_QHASH ON dbo.StringDup (Q_Hash);

    CREATE INDEX IX_AHASH ON dbo.StringDup (A_Hash);

    -- The CTE contains the hash keys which appear more than once

    -- The join then retrieves rows with those hash keys where the Question is also a duplicate

    ;WITH Dup AS (SELECT Q_Hash FROM dbo.StringDup GROUP BY Q_Hash HAVING COUNT_BIG(*) > 1)

    SELECTS.Question, COUNT_BIG(*) AS dup_cnt

    FROMDup

    INNER

    LOOP

    JOINdbo.StringDup AS S WITH(INDEX(IX_QHASH))

    ON(Dup.Q_Hash = S.Q_Hash)

    GROUPBY

    S.Question

    HAVINGCOUNT_BIG(*) > 1 -- Question string appears more than once

    ORDERBY

    dup_cnt DESC;

    The order by can be removed if you don't need it.

    The code for finding duplicate Answers is spookily similar.

    The loop join and index hint are there to ensure the optimizer gets the message, and doesn't retrieve the long string column to soon.

    Have fun!

    /Paul

  • I am really thankful to you it really solved my problem. Thank you !

    Musab
    http://www.sqlhelpline.com

  • I am really thankful to you it really solved my problem. Thank you !

    I will be looking for more from you thanks.

    Musab
    http://www.sqlhelpline.com

  • Hi

    It would be nice if you tell us final solution/implementation.

  • Paul White gave the best solutions which is implementable on Large table in a very efficient way.

    if you need any kind of explanation i will be more than happy to give you it.

    Musab
    http://www.sqlhelpline.com

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply