October 31, 2017 at 9:24 am
Hi there,
I am trying to figure out the best way to rank a table that has 600 million rows.
This is data about Company financial info from different companies.
Eg: ID=1 (there is a reference table for ID to CompanyName) ; iDate (date when data was received) ; FinancialInfoName can be sales, Loans or Deposits; Value is value of the FinancialInfoName;
Create table dbo.tbl_needs_ranking(ID int not null, iDate Smalldatetime null,Value Decimal(22,2),FinancialInfoName varchar(100) null, Rnk int null)
This table gets data from 2 different work tables.
dbo.table1 and dbo.table2 each have about 300 million rows.
dbo.table1 has data from 2016 and dbo.table2 from 2015.
This is how I insert
TRUNCATE TABLE dbo.tbl_needs_ranking
INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk)
SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table1
INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk)
SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table2
Rnk is a column that is rank by ID and Date.
;WITH CTE AS
(
SELECT Rnk
,[ID]
,[iDate]
,DENSE_RANK() over (ORDER BY id, iDate) AS ranking
FROM dbo.tbl_needs_ranking WITH (TABLOCK)
) UPDATE CTE SET Rnk=CTE.ranking
It's been 3 hours and not yet finished running.
What's the best way to rank this table?
Thank you
MR
October 31, 2017 at 9:56 am
600M records is a LOT of records. Even if your disc speed (both read and write) is great, upi have loads of RAM, etc, it is going to take a long time.
Doing an INSERT and then an UPDATE isn't going to help matter though. That means, firstly, all the data needs to be read and written (in the tables) for the initial INSERT(s), but then read again (in your new table) and then written to with the new Rank.
I would firstly ensure that you do the whole process in one go:TRUNCATE TABLE dbo.tbl_needs_ranking;
--UNION the data first
WITH CoreData AS (
SELECT ID, iDate, Value, FinancialInfoName,
FROM dbo.table1
UNION ALL
SELECT ID, iDate, Value, FinancialInfoName
FROM dbo.table2),
--Now calculate the rankings
RankedData AS (
SELECT *,
DENSE_RANK() OVER (ORDER BY id, idate) AS Ranking
FROM CoreData)
INSERT INTO dbo.tbl_needs_ranking (ID,iDate,Value,FinancialInfoName,Rnk)
SELECT ID, iDate, Value, FinancialInfoName, Ranking
FROM RankedData;
This, won't, necessarily, be that quick still; depending on what indexes you have on your tables. You're going to need as appropriate INDEX that the server can use to be able to ORDER by id and idate columns in both tables.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
October 31, 2017 at 10:01 am
What indexes do you have on table1, table2 and tbl_needs_ranking?
Have you considered something like this - doing the ranking and both inserts all in one operation? Your transaction log and/or tempdb may not thank you, but it's worth seeing how it compares on a test database with the same data.WITH Combined AS (
SELECT ID,iDate,Value,FinancialInfoName
FROM dbo.table1
UNION
SELECT ID,iDate,Value,FinancialInfoName
FROM dbo.table2
)
INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk)
SELECT ID,iDate,Value,FinancialInfoName,
DENSE_RANK() over (ORDER BY id, iDate)
FROM Combined;
John
October 31, 2017 at 11:47 am
The best way to do this is to cluster the table on ( id, iDate ); that is, the clustered index on the table should be on those columns. And, as long as you look up by id and/or ( id, iDate ) at least half the time on that table, then that's what you should do. Since I/we don't see other queries, I/we don't know whether you need nonclus index(es) or not.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
October 31, 2017 at 11:57 am
I only have non-clustered index on dbo.table1 and dbo.table2.
And after the update (it never did finish) I have clustered index on dbo.tbl_needs_ranking
October 31, 2017 at 12:38 pm
If you know that all rows in table1 come before rows in table2, you can go ahead and assign the ranking as part of the load of the table. It's faster to sort 300M rows and 300M rows than to sort 600M rows.
TRUNCATE TABLE dbo.tbl_needs_ranking;
--drop any nonclus index on dbo.tbl_needs_ranking
CREATE CLUSTERED INDEX tbl_needs_ranking__CL ON dbo.tbl_needs_ranking ( ID, iDate ) WITH ( FILLFACTOR = 99 ) ON [PRIMARY];
INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk) WITH (TABLOCK)
SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table1
INSERT INTO dbo.tbl_needs_ranking(ID,iDate,Value,FinancialInfoName,Rnk) WITH (TABLOCK)
SELECT ID,iDate,Value,FinancialInfoName,NULL FROM dbo.table2
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply