March 26, 2009 at 1:37 am
Hi !
I am using data type nvarchar(max) and have 13 million records and want to get unique records out of it based on the text.
but the query takes too long.
How can I speed it up ?
Regards
Musab
Musab
http://www.sqlhelpline.com
March 26, 2009 at 2:10 am
musab (3/26/2009)
Hi !I am using data type nvarchar(max) and have 13 million records and want to get unique records out of it based on the text.
but the query takes too long.
How can I speed it up ?
Regards
Musab
What kind of data is it ...how it looks can you post example ...and 13 million records are too many records to select the distinct values till you don't have any Index on that column!
:w00t:
March 26, 2009 at 3:31 am
basically the database is of a question/answer website so i have to do some maintenance on that DB and it has long text data so I need to find some unique question and answers. so how can I make it go faster ?
and as its a nvarchar(max) i cant use it as a key column for index.
any suggestions ?
Musab
http://www.sqlhelpline.com
March 26, 2009 at 3:35 am
musab (3/26/2009)
Hi !I am using data type nvarchar(max) and have 13 million records and want to get unique records out of it based on the text.
but the query takes too long.
How can I speed it up ?
Regards
Musab
Set up the query as a derived table, take out the nvarchar(max) column.
Join the derived table back to the source table to collect it. Can you post your query? It's probably easier to do it for you rather than explain.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 26, 2009 at 5:19 am
I used the below mentioned query to find the DISTINCT records
SELECT DISTINCT Column1+cast(Column2 as varchar)
FROM TableName
it tooklong time
now I need to delete the duplicates and when using CTE its also taking almost two hours to remove duplicate from 10,000 records and I have a total of 13 Million
Below is the query for removing Duplicates !
SET ROWCOUNT 10000
GO
With Dups as
(
select row_number() over (partition by Column1+cast(Column2 as varchar) order by Column1+cast(Column2 as varchar))as RowNum
from TableName
)
DELETE from Dups where rownum > 1;
Musab
http://www.sqlhelpline.com
March 26, 2009 at 5:30 am
You should look up full text indexing for the varchar(max) column. That's the best mechanism for use with this sort of thing.
Are there other indexes on the table? A PK?
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 26, 2009 at 5:41 am
Thankyou very much for your valuable replies.
SQL Server Full Text Search is a keyword based search and I need to find the exact duplicate string which should then be removed from the table, so I am unable to find any solution for the removal of duplicates using Full Text.
FTS always delivers the rows with matching keywords.
I have other indexes as well on this table PK and other Indexes which are of some IDs.
so any suggestions ?
Musab
http://www.sqlhelpline.com
March 26, 2009 at 5:46 am
Not without some structures and execution plans. It's just too hard to guess at what might work in your circumstances.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 26, 2009 at 5:54 am
USE [MaybeNowStructure]
GO
/****** Object: Table [dbo].[Question] Script Date: 03/26/2009 16:49:16 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Question](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[UserID] [bigint] NULL,
[VisitorID] [bigint] NULL,
[Text] [nvarchar](max) NULL,
//The Text Column is the column which contains the string data
[TimeStamp] [datetime] NOT NULL CONSTRAINT [DF_Question_TimeStamp] DEFAULT (getdate()),
[ViewCount] [bigint] NOT NULL CONSTRAINT [DF_Question_ViewCount] DEFAULT ((0)),
[MetaKeywords] [nvarchar](500) NULL,
[MetaDescription] [nvarchar](500) NULL,
[PageURL] [nvarchar](500) NULL,
[CategoryID] [bigint] NULL,
[LocationID] [bigint] NULL,
[Status] [int] NOT NULL CONSTRAINT [DF_Question_Status] DEFAULT ((1)),
[MisplacedCount] [int] NOT NULL CONSTRAINT [DF_Question_MisplacedCount] DEFAULT ((0)),
[SpamCount] [int] NOT NULL CONSTRAINT [DF_Question_SpamCount] DEFAULT ((0)),
[InappropriateCount] [int] NOT NULL CONSTRAINT [DF_Question_InappropriateCount] DEFAULT ((0)),
[MarkedAsMisplaced] [bit] NOT NULL CONSTRAINT [DF_Question_MarkedAsMisplaced] DEFAULT ((0)),
[MarkedAsSpam] [bit] NOT NULL CONSTRAINT [DF_Question_MarkedAsSpam] DEFAULT ((0)),
[MarkedAsInappropriate] [bit] NOT NULL CONSTRAINT [DF_Question_MarkedAsInappropriate] DEFAULT ((0)),
[DiscussionCount] [bigint] NOT NULL CONSTRAINT [DF_Question_DiscussionCount] DEFAULT ((0)),
[ExpertCount] [bigint] NOT NULL CONSTRAINT [DF_Question_ExpertCount] DEFAULT ((0)),
[IPAddress] [varchar](50) NULL,
[ProxyAddress] [varchar](50) NULL,
[IsFeatured] [bit] NULL CONSTRAINT [DF_Question_IsFeatured] DEFAULT ((0)),
[ExpertID] [bigint] NULL,
[ExpertName] [nvarchar](100) NULL,
[ExpertURL] [varchar](500) NULL,
[Source] [smallint] NULL,
//The source is second column which I am concatinating with Text to find the duplicates
[SourceURL] [varchar](500) NULL,
[QuestionDescription] [nvarchar](max) NULL,
[ListingDescription] [nvarchar](max) NULL,
[SuckerID] [bigint] NULL,
[SuckerType] [bigint] NULL,
[LastDiscussedTimeStamp] [datetime] NULL,
[RowVer] [timestamp] NULL,
[IsUpdatedByAdmin] [bit] NULL CONSTRAINT [DF_Question_IsUpdatedByAdmin] DEFAULT ((0)),
[LastUpdatedTimestamp] [datetime] NULL,
CONSTRAINT [PK_Question] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [dbo].[Question] WITH CHECK ADD CONSTRAINT [FK_Question_Category] FOREIGN KEY([CategoryID])
REFERENCES [dbo].[Category] ([ID])
GO
ALTER TABLE [dbo].[Question] CHECK CONSTRAINT [FK_Question_Category]
FULL Text Search is enabled on this table.
Anything Else you need to know Please I will be more than happy to explain.
Musab
http://www.sqlhelpline.com
March 26, 2009 at 5:54 am
Can you tell us something about the distribution of values in column1? Does the column have unique values which could be omitted from your process?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 26, 2009 at 6:02 am
The Column1 i.e. Text contains questions and I need to find unique questions and there are different questions in it and duplicates as well so I just need to delete the duplicates from this.
Musab
http://www.sqlhelpline.com
March 26, 2009 at 6:07 am
musab (3/26/2009)
The Column1 i.e. Text contains questions and I need to find unique questions and there are different questions in it and duplicates as well so I just need to delete the duplicates from this.
What about column2?
Which columns are they in your table?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 26, 2009 at 6:12 am
How about using the last update date column as a differentiator. Then you can order by that column and join on the text column. I would think that should get a much better performance. Also you can add an index there.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 26, 2009 at 6:22 am
Column1 is the Text Column
Column2 is the Source Column
I need to concatinate both the columns and then need to Remove the duplicates
The rows are complete duplicates all the data is same. So I cant use any other column to differentiate it.
as it was a script mistake which wrongly imported the duplicate data.
Musab
http://www.sqlhelpline.com
March 26, 2009 at 6:32 am
musab (3/26/2009)
Column1 is the Text ColumnColumn2 is the Source Column
I need to concatinate both the columns and then need to Remove the duplicates
The rows are complete duplicates all the data is same. So I cant use any other column to differentiate it.
as it was a script mistake which wrongly imported the duplicate data.
Are you sure about that?
CREATE TABLE [dbo].[Question](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 15 posts - 1 through 15 (of 33 total)
You must be logged in to reply to this topic. Login to reply