Long running queries when using DISTINCT with Strings

Question

Long running queries when using DISTINCT with Strings

MusabUmair

Ten Centuries

Points: 1071
More actions
March 26, 2009 at 1:37 am

#198358

Hi !
I am using data type nvarchar(max) and have 13 million records and want to get unique records out of it based on the text.
but the query takes too long.
How can I speed it up ?
Regards
Musab
Musab
http://www.sqlhelpline.com

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Dugi SSCoach Points: 17998 More actions · Answer 1

musab (3/26/2009)
Hi !
I am using data type nvarchar(max) and have 13 million records and want to get unique records out of it based on the text.
but the query takes too long.
How can I speed it up ?
Regards
Musab

What kind of data is it ...how it looks can you post example ...and 13 million records are too many records to select the distinct values till you don't have any Index on that column!

:w00t:

============================================================
SELECT YOUR PROBLEM FROM SSC.com WHERE PROBLEM DESCRIPTION =
http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]

MusabUmair Ten Centuries Points: 1071 More actions · Answer 2

basically the database is of a question/answer website so i have to do some maintenance on that DB and it has long text data so I need to find some unique question and answers. so how can I make it go faster ?

and as its a nvarchar(max) i cant use it as a key column for index.

any suggestions ?

Musab
http://www.sqlhelpline.com

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

musab (3/26/2009)
Hi !
I am using data type nvarchar(max) and have 13 million records and want to get unique records out of it based on the text.
but the query takes too long.
How can I speed it up ?
Regards
Musab

Set up the query as a derived table, take out the nvarchar(max) column.

Join the derived table back to the source table to collect it. Can you post your query? It's probably easier to do it for you rather than explain.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

MusabUmair Ten Centuries Points: 1071 More actions · Answer 4

I used the below mentioned query to find the DISTINCT records

SELECT DISTINCT Column1+cast(Column2 as varchar)

FROM TableName

it tooklong time

now I need to delete the duplicates and when using CTE its also taking almost two hours to remove duplicate from 10,000 records and I have a total of 13 Million

Below is the query for removing Duplicates !

SET ROWCOUNT 10000

GO

With Dups as

(

select row_number() over (partition by Column1+cast(Column2 as varchar) order by Column1+cast(Column2 as varchar))as RowNum

from TableName

)

DELETE from Dups where rownum > 1;

Musab
http://www.sqlhelpline.com

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 5

You should look up full text indexing for the varchar(max) column. That's the best mechanism for use with this sort of thing.

Are there other indexes on the table? A PK?

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

MusabUmair Ten Centuries Points: 1071 More actions · Answer 6

Thankyou very much for your valuable replies.

SQL Server Full Text Search is a keyword based search and I need to find the exact duplicate string which should then be removed from the table, so I am unable to find any solution for the removal of duplicates using Full Text.

FTS always delivers the rows with matching keywords.

I have other indexes as well on this table PK and other Indexes which are of some IDs.

so any suggestions ?

Musab
http://www.sqlhelpline.com

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 7

Not without some structures and execution plans. It's just too hard to guess at what might work in your circumstances.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

MusabUmair Ten Centuries Points: 1071 More actions · Answer 8

USE [MaybeNowStructure]

GO

/****** Object: Table [dbo].[Question] Script Date: 03/26/2009 16:49:16 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

SET ANSI_PADDING ON

GO

CREATE TABLE [dbo].[Question](

[ID] [bigint] IDENTITY(1,1) NOT NULL,

[UserID] [bigint] NULL,

[VisitorID] [bigint] NULL,

[Text] [nvarchar](max) NULL,

//The Text Column is the column which contains the string data

[TimeStamp] [datetime] NOT NULL CONSTRAINT [DF_Question_TimeStamp] DEFAULT (getdate()),

[ViewCount] [bigint] NOT NULL CONSTRAINT [DF_Question_ViewCount] DEFAULT ((0)),

[MetaKeywords] [nvarchar](500) NULL,

[MetaDescription] [nvarchar](500) NULL,

[PageURL] [nvarchar](500) NULL,

[CategoryID] [bigint] NULL,

[LocationID] [bigint] NULL,

[Status] [int] NOT NULL CONSTRAINT [DF_Question_Status] DEFAULT ((1)),

[MisplacedCount] [int] NOT NULL CONSTRAINT [DF_Question_MisplacedCount] DEFAULT ((0)),

[SpamCount] [int] NOT NULL CONSTRAINT [DF_Question_SpamCount] DEFAULT ((0)),

[InappropriateCount] [int] NOT NULL CONSTRAINT [DF_Question_InappropriateCount] DEFAULT ((0)),

[MarkedAsMisplaced] [bit] NOT NULL CONSTRAINT [DF_Question_MarkedAsMisplaced] DEFAULT ((0)),

[MarkedAsSpam] [bit] NOT NULL CONSTRAINT [DF_Question_MarkedAsSpam] DEFAULT ((0)),

[MarkedAsInappropriate] [bit] NOT NULL CONSTRAINT [DF_Question_MarkedAsInappropriate] DEFAULT ((0)),

[DiscussionCount] [bigint] NOT NULL CONSTRAINT [DF_Question_DiscussionCount] DEFAULT ((0)),

[ExpertCount] [bigint] NOT NULL CONSTRAINT [DF_Question_ExpertCount] DEFAULT ((0)),

[IPAddress] [varchar](50) NULL,

[ProxyAddress] [varchar](50) NULL,

[IsFeatured] [bit] NULL CONSTRAINT [DF_Question_IsFeatured] DEFAULT ((0)),

[ExpertID] [bigint] NULL,

[ExpertName] [nvarchar](100) NULL,

[ExpertURL] [varchar](500) NULL,

[Source] [smallint] NULL,

//The source is second column which I am concatinating with Text to find the duplicates

[SourceURL] [varchar](500) NULL,

[QuestionDescription] [nvarchar](max) NULL,

[ListingDescription] [nvarchar](max) NULL,

[SuckerID] [bigint] NULL,

[SuckerType] [bigint] NULL,

[LastDiscussedTimeStamp] [datetime] NULL,

[RowVer] [timestamp] NULL,

[IsUpdatedByAdmin] [bit] NULL CONSTRAINT [DF_Question_IsUpdatedByAdmin] DEFAULT ((0)),

[LastUpdatedTimestamp] [datetime] NULL,

CONSTRAINT [PK_Question] PRIMARY KEY CLUSTERED

(

[ID] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

GO

SET ANSI_PADDING OFF

GO

ALTER TABLE [dbo].[Question] WITH CHECK ADD CONSTRAINT [FK_Question_Category] FOREIGN KEY([CategoryID])

REFERENCES [dbo].[Category] ([ID])

GO

ALTER TABLE [dbo].[Question] CHECK CONSTRAINT [FK_Question_Category]

FULL Text Search is enabled on this table.

Anything Else you need to know Please I will be more than happy to explain.

Musab
http://www.sqlhelpline.com

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

Can you tell us something about the distribution of values in column1? Does the column have unique values which could be omitted from your process?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

MusabUmair Ten Centuries Points: 1071 More actions · Answer 10

The Column1 i.e. Text contains questions and I need to find unique questions and there are different questions in it and duplicates as well so I just need to delete the duplicates from this.

Musab
http://www.sqlhelpline.com

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

musab (3/26/2009)
The Column1 i.e. Text contains questions and I need to find unique questions and there are different questions in it and duplicates as well so I just need to delete the duplicates from this.

What about column2?

Which columns are they in your table?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 12

How about using the last update date column as a differentiator. Then you can order by that column and join on the text column. I would think that should get a much better performance. Also you can add an index there.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

MusabUmair Ten Centuries Points: 1071 More actions · Answer 13

Column1 is the Text Column

Column2 is the Source Column

I need to concatinate both the columns and then need to Remove the duplicates

The rows are complete duplicates all the data is same. So I cant use any other column to differentiate it.

as it was a script mistake which wrongly imported the duplicate data.

Musab
http://www.sqlhelpline.com

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

musab (3/26/2009)
Column1 is the Text Column
Column2 is the Source Column
I need to concatinate both the columns and then need to Remove the duplicates
The rows are complete duplicates all the data is same. So I cant use any other column to differentiate it.
as it was a script mistake which wrongly imported the duplicate data.

Are you sure about that?

CREATE TABLE [dbo].[Question](

[ID] [bigint] IDENTITY(1,1) NOT NULL,

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden