August 15, 2014 at 9:06 am
Hi there,
I have a table that contains Financial information in text format for each company along with each Company ID. This table gets updated every two weeks. It has around 25 million records and changes with updates. I take this table and trying to create Lucene index using C# program. Instead of pulling all records at once, I thought I will do this
1. select Count(*)/4 into firstpart from dbo.CompanyFinance
2. with Records AS(select row_number() over(order by d) as 'row', *
from dbo.CompanyFinance)
select id,FinancialInfo from records
where row<firstpart
To get the next 1/4th of the records? The way I thought of doing was
3. select Count(*)/2 into temp.
4. Get the id for the 12.5 millionth record (just like step 2) and
5. then Select id,FinancialInfo from dbo.CompanyFinance where id between firstpart+1 and id from step 4
Here's my problem is the query execution time. Step 1,2,3,4 are taking a very long time to run. Is there a better query you can suggest.
Create table dbo.CompanyFinance(id int not null, Financialinfo nvarchar(max), CONSTRAINT [pk_id] PRIMARY KEY CLUSTERED
([Id] ASC)
Insert into dbo.CompanyFinance values(1,'This is a test');
Insert into dbo.CompanyFinance values(2,'I have very large financial info in this field');
Thanks
Rash
August 15, 2014 at 9:20 am
You're attempting to query 8.25 million records out of 25 million. It's going to scan the table to do this. Indexes won't help. You can get faster or more disks to help speed this up, or you can attempt to move smaller amounts of data so that an index will be of assistance.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 15, 2014 at 10:28 am
When you say "you can attempt to move smaller amounts of data so that an index will be of assistance. " are you saying I should get select top 1000 rows at a time instead of several million.
August 15, 2014 at 10:44 am
The issue is, there's no way within code or structure to speed up moving 1/4 of a 25 million row table. You are completely dependent on hardware because that will lead to a scan of the table. So, in order for you to arrive at benefits to access speed presented by indexes, you have to deal with substantially smaller amounts of data. But, there's a balancing act. Yes, 1,000 rows will result in a seek most likely. But, multiply 1,000 row seeks often enough to move 8+ million rows, suddenly that may no longer be desirable. So you have to balance the benefits of being able to use the index on the table to help you with the amount of data you need to move.
I'd suggest questioning the need to move all the table. Is there a way to incrementally update this thing you're doing rather than a complete replace each time? Incremental data moves can then take advantage of indexes to assist your queries.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 15, 2014 at 12:23 pm
What is(are) the clustering column(s) on the table? You need to read each 1/4 of the table at a time using the appropriate clustering key range.
What are there nonclustering indexe(es)? (in case they are needed to determine the clustering key ranges).
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 16, 2014 at 4:19 am
ScottPletcher (8/15/2014)
What is(are) the clustering column(s) on the table? You need to read each 1/4 of the table at a time using the appropriate clustering key range.What are there nonclustering indexe(es)? (in case they are needed to determine the clustering key ranges).
Based on the structure posted above, it's an ID/Value table and the cluster is on the ID. With that kind of structure and the data volumes we're talking about, I'm still back to questioning the need for moving the entire table around, let alone tuning that.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 16, 2014 at 9:27 pm
Grant Fritchey (8/16/2014)
ScottPletcher (8/15/2014)
What is(are) the clustering column(s) on the table? You need to read each 1/4 of the table at a time using the appropriate clustering key range.What are there nonclustering indexe(es)? (in case they are needed to determine the clustering key ranges).
Based on the structure posted above, it's an ID/Value table and the cluster is on the ID. With that kind of structure and the data volumes we're talking about, I'm still back to questioning the need for moving the entire table around, let alone tuning that.
Sorry, didn't see that.
Assuming you can accept "close enough" to 1/4, rather than needing an exact 1/4, you could get the min and max, take the diff, divide by 4, and read each 1/4 range of ids. Be sure to explicitly specify id BETWEEN @id_start_of_fourth AND @id_end_of_fourth.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply