Creating indexes on a very large table takes over 5 hours

Question

Creating indexes on a very large table takes over 5 hours

SSC Enthusiast

Points: 190

December 8, 2020 at 6:00 pm

I have a table that is 125GB. It only has 5 Columns but has almost a billing rows. This is a fact table that gets regenerated nightly . I currently do this by creating a staging table inserting the data and then creating the indexes. I have tried doing the opposite. Creating the indexes first and then inserting the data but it completely blows up the transaction log and I've never been able to have enough space. there is currently no way to incrementally populate the table.

Is there anything else I can do to speed up the indexing process?

If exists(select 1 from sys.tables where name='Staging') DROP TABLE  Staging


select distinct 
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
  l.FNOL_PROGRAM_ID as ProgramID
INTO FactLoss_Staging
From Table 1 l 
INNER JOIN Table 2 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
        INNER JOIN Table 3ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd

Union ALL
select distinct 
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
c.programid 
From loss l 
INNER JOIN table 1 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
        INNER JOIN  table2 c on c.mxlossid = l.prj_loss_id_nb and c.active = 1
        INNER JOIN  table 3 ul on ul.programid = c.programid

GO

CREATE CLUSTERED INDEX IX_FactLoss_LossID
ON Staging ([LossID])
WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
GO

CREATE NONCLUSTERED INDEX IX_FactLoss_LossID_PrimAcctCd_UserID
ONStaging ([LossID], [UserID], [primaryaccountcode])
WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
GO

CREATE NONCLUSTERED INDEX IX_FactLoss_UserID_PrimAcctCd_Includes
ON [dbo].Staging ([USERID], [primaryaccountcode])
INCLUDE ([LossID],[LossNumber])
WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
GO

CREATE NONCLUSTERED INDEX IX_FactLoss_PrimAcctCd_UserID_Includes2
ON [dbo].Staging ([primaryaccountcode],[USERID])
INCLUDE ([LossID],[FranID],[LossNumber],[ProgramID])
WITH (FILLFACTOR=80, ONLINE=ON, SORT_IN_TEMPDB=OFF)
GO

If exists(select 1 from sys.tables where name='FactLoss_Old') Drop Table FactLoss_Old


EXEC sp_rename 'FactLoss', 'FactLoss_Old'
EXEC sp_rename 'Staging', 'FactLoss'

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61152 More actions · Answer 1

there is currently no way to incrementally populate the table.
... a [Huge] fact table that gets regenerated nightly

This is a design/process issue more than a technical one. I'll put my 2 cents in to bump this topic up but can't offer a solution that doesn't involve incrementally building the index.

If incrementally repopulating the table isn't an option then, for this type of thing I generally:

1. Create a duplicate table with the indexes in place

2. Perform an incremental batch insert routine to the new table

3. Drop the old one and point data requests to the new table.

If this IS the solution as far as I know. Since you're not talking about OLTP, there shouldn't be any interpretation.

The only other possible solution is to see of the index builds are happening with a parallel execution plan. That's an option in 2016 and will speed things up a ton if your getting a serial plan for some reason.

Again, I'd yell until someone allowed incremental updates. A big fact table repopulating daily is no way to live.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

ScottPletcher SSC Guru Points: 101224 More actions · Answer 2

I'd recommend trying the approach below.

Specifically, the code below:

1. Changes the fillfactor to 99. 80 is way too small (unless you do massive updates to almost the entire table later?! And even then, not really).

2. Creates the clustered index before loading the table. That saves a full extra load of the table. If at all possible, make the index unique, since unique indexes perform much better overall in SQL Server.

3. Uses page compression (if you're using SQL 2016, this is available in Standard Edition).

4. Reduces the number of indexes. If you truly must have an index headed by USERID -- and test that first -- add that index back in.

5. Sets ONLINE = OFF and SORT_IN_TEMPDB = ON /*Edit*/

If exists(select 1 from sys.tables where object_id = OBJECT_NAME('dbo.Staging')) 
    DROP TABLE dbo.Staging

/* create table structure *only* */
select distinct top (0)
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
  l.FNOL_PROGRAM_ID as ProgramID
INTO FactLoss_Staging
From Table 1 l 
INNER JOIN Table 2 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
        INNER JOIN Table 3ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd

/* create clustered index *before* initial load of table */
CREATE /*UNIQUE*/ CLUSTERED INDEX IX_FactLoss_LossID /* make UNIQUE if at all possible */
ON dbo.Staging ( LossID, USERID )
WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR=99, ONLINE=OFF, SORT_IN_TEMPDB=ON )

/* load the table */
        INSERT INTO Staging WITH (TABLOCK)
select distinct 
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
  l.FNOL_PROGRAM_ID as ProgramID
From Table 1 l 
INNER JOIN Table 2 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
        INNER JOIN Table 3ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd
UNION ALL
select distinct 
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
c.programid 
From loss l 
INNER JOIN table 1 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd
        INNER JOIN  table2 c on c.mxlossid = l.prj_loss_id_nb and c.active = 1
        INNER JOIN  table 3 ul on ul.programid = c.programid
GO

/* create nonclustered index(es) */

CREATE NONCLUSTERED INDEX IX_FactLoss_PrimAcctCd_UserID_Includes2
ON dbo.Staging ( primaryaccountcode, USERID ) /*whichever key first that better matches your queries*/
INCLUDE ( FranID, LossID, LossNumber, ProgramID )
WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR=99, ONLINE=OFF, SORT_IN_TEMPDB=ON )
GO

This reply was modified 5 years, 1 months ago by ScottPletcher. Reason: Added ONLINE = OFF to list of changes I made. I forgot to mention this before, but I did make the change

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

frederico_fonseca SSCoach Points: 16297 More actions · Answer 3

using the ONLINE=ON is most likely what is killing the index creation.

but all other comments still valid.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 4

pamkagel wrote:

I have a table that is 125GB. It only has 5 Columns but has almost a billing rows. This is a fact table that gets regenerated nightly . I currently do this by creating a staging table inserting the data and then creating the indexes. I have tried doing the opposite. Creating the indexes first and then inserting the data but it completely blows up the transaction log and I've never been able to have enough space. there is currently no way to incrementally populate the table.
Is there anything else I can do to speed up the indexing process?

I have a couple of questions...

How long does the code currently take just to repopulate the truncated table WITHOUT doing anything else such as rebuilding the indexes?
Is the table to be repopulated in the same database as the source data?
1. If not, what is the Recovery Model of the target database?
2. If so, what is the Recovery Model of the current database?
3. If the answer to question 2 is "Full Recovery", can we move the table you're repopulating to a different database that uses the Simple Recovery Model and point synonyms at it?
Is there some temporal column in the database where we could "divy up" the target table and use a partitioned view to do a more rapid "parallel load"?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sergiy SSC Guru Points: 110209 More actions · Answer 5

Are you sure you need that table at all?

Why wouldn't you try to create a view base on the same query you use to populate the table and then query that view instead on the table?

_____________
Code for TallyGenerator

pamkagel SSC Enthusiast Points: 190 More actions · Answer 6

pamkagel

SSC Enthusiast

Points: 190

December 9, 2020 at 4:43 pm

#3818672

Tried creating the view. It was non performant .

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 7

pamkagel wrote:

Tried creating the view. It was non performant .

Then you're doing it wrong. Please answer the questions I posted or I don't know enough how to help you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

pamkagel SSC Enthusiast Points: 190 More actions · Answer 8

Thank you all for your responses. I am testing out some of your suggestions.

@jeffModen

It currently takes about two hours to load the table.
Same database and Database is in Simple Recovery right now
And unfortunately no there is no temporal column

ScottPletcher SSC Guru Points: 101224 More actions · Answer 9

I'm also trying to improve overall performance, not just the load time.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Sergiy SSC Guru Points: 110209 More actions · Answer 10

pamkagel wrote:

Tried creating the view. It was non performant .

I had a closer look at the query and I'm not surprised it's not performant.

First - what do you need the "distinct" thing for? Are you trying to eliminate some duplicates or is it "just in case"?

Either way you should not use it. Ever. Just forget that keyword.

If you know cases of data duplication - find where do they come from and eliminate the source of it.

If it's not possible to eliminate them at the source - use GROUP BY.

Second. No data from table "a" used in SELECT. therefore the table must be taken from FROM part and placed into WHERE EXISTS.

I modified the query to be a better performer:

select 
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
l.FNOL_PROGRAM_ID as ProgramID
From Table1 l 
INNER JOIN Table3 ul on ul.franid = l.franid and ul.primaryaccountcode = l.pri_acct_cd
WHERE EXISTS (select * from Table2 a where a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd)
UNION ALL
select  
ul.primaryaccountcode,
ul.USERID,
l.loss_id_nb 'LossID',
l.franid 'FranID',
l.loss_nm 'LossNumber',
c.programid 
From loss l 
INNER JOIN  table2 c on c.mxlossid = l.prj_loss_id_nb and c.active = 1
INNER JOIN  table3 ul on ul.programid = c.programid
WHERE EXISTS (select * from table1 a on a.franid = l.franid and a.pri_acct_cd = l.pri_acct_cd)

Third. Try to run your reports from the view with the modified version of the query. Check the execution plans. See if some index tweaking on the underlying tables may improve performance.

You may wish to post execution plans here - someone must be able to show you the way.

_____________
Code for TallyGenerator

pamkagel SSC Enthusiast Points: 190 More actions · Answer 11

Thank you Sergy. I'm still doing some testing but those query changes made a difference so far.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

@pamkagel ,

Ok... you said that the entire table is generated nightly. That makes me want to assume that you're not actually doing any updates or additional inserts to the table. With that, I have to ask why on Earth your indexes are being built with an 80% Fill Factor. That means that you're wasting (100/80-1)*100 percent of disk space and memory or, in the case of your 125GB table, you're wasting 31.25GB of disk space and RAM and that's just for the Clustered Index.

So, if you're not doing any updates or inserts (and you should NOT be because you throw the table away every day), build the indexes either with the default (which should default to 0 and that's the same as 100 unless you made the terrible mistake of changing the default) or explicitly build them at 100.

You also mentioned that the database is in the SIMPLE recovery model. That's actually good for this. It will easily allow for "Minimal Logging", which will make your original build twice as fast from the git IF you setup for minimal logging. Here's how to do that...

Don't use SELECT/INTO because that won't build your clustered index. Instead, explicitly create the table WITH the clustered index in place on the empty table. Then, whatever SELECT you end up with, it must be in the following form to achieve "Minimal Logging", which is nasty fast and won't require you to build the clustered index separately...

Create the table with an explicit CREATE TABLE statement that also creates the Clustered Index. This can be in the form of a Clustered PK constraint or as an explicit index.
Do NOT create any non-clustered indexes yet and don't listen to people that will rail about Trace Flag 610. It can work sometimes but about 99% of the time it won't (personal experience talking there).
Do your insert in the following form...

 INSERT INTO dbo.yourtablename WITH (TABLOCK)
        (full column list here)
 SELECT full column list here
   FROM dbo.whatever
   JOIN dbo.whatever etc
  WHERE whatever relationship is necessary to get the right data but must be SARGable
  ORDER BY list of column(s) in the exact same order as the clustered index
 OPTION (RECOMPILE) --undocumented but usually required and won't be in execution plan if you're lucky with your joins
;

The WITH(TABLOCK), ORDER BY in the correct column order, and the OPTION (RECOMPILE) are absolutely essential to get the minimal logging.

Now, that code will be slower than SELECT/INTO but it will be much faster than SELECT/INTO followed by you building a Clustered Index. It will also take half the memory, half the disk space, and a remarkably small amount of transaction log file space. In one of my tests, I built two 120 MILLION row tables with row widths of 124 bytes each (including the unseen row headers) with the clustered indexes in place in 4 minutes flat for a total of 25.153GB of input and it only used 600MB of log file space. Adding a non-clustered index to each table took an additional minute and caused no additional growth. It's a small example but the 33GB total size of the database was all created in a little over 5 minutes and each table contain 120 MILLION rows.

Let me know if there's a column that has a range that's different in the first query than in the second. If so, there's another trick I can show you to pull of parallel loads into two separate tables using the high performance method of loading above and then using a "Partitioned View" to make them look like one table.

Last but not least and when the table is loaded with the clustered index in place, THEN build your non-clustered indexes at a 100 FILL FACTOR.

Of course, you actually have already been told partially how to do this by Scott in his post above. The two things he left out is the ORDER BY to guarantee the inserts will "follow the rules" for "Minimal Logging" and the undocumented OPTION(RECOMPILE), which won't hurt if it's actually not necessary and helps guarantee that you'll get the "Minimal Logging". In that dual 120 Million row row insert test I told you about above, IT was the difference between a 4 minute load/600MB log file and a 20 minutes load/40GB log file.

I still question the need to rebuild a table with "billings of rows" every night but don't know your data well enough to make a suggestion there. Both Sergiy and Scott are also correct about the performance of the SELECTs being critical but I can't make an informed suggestion because, like I said, I don't know your data and you've not included an Actual Execution Plan in your postings. See the article at the second link in my signature line below for one of the best ways to post performance issues.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 101224 More actions · Answer 13

Jeff Moden wrote:

@pamkagel ,
Ok... you said that the entire table is generated nightly. That makes me want to assume that you're not actually doing any updates or additional inserts to the table. With that, I have to ask why on Earth your indexes are being built with an 80% Fill Factor. That means that you're wasting (100/80-1)*100 percent of disk space and memory or, in the case of your 125GB table, you're wasting 31.25GB of disk space and RAM and that's just for the Clustered Index.
So, if you're not doing any updates or inserts (and you should NOT be because you throw the table away every day), build the indexes either with the default (which should default to 0 and that's the same as 100 unless you made the terrible mistake of changing the default) or explicitly build them at 100.
You also mentioned that the database is in the SIMPLE recovery model. That's actually good for this. It will easily allow for "Minimal Logging", which will make your original build twice as fast from the git IF you setup for minimal logging. Here's how to do that...
Don't use SELECT/INTO because that won't build your clustered index. Instead, explicitly create the table WITH the clustered index in place on the empty table. Then, whatever SELECT you end up with, it must be in the following form to achieve "Minimal Logging", which is nasty fast and won't require you to build the clustered index separately...
Create the table with an explicit CREATE TABLE statement that also creates the Clustered Index. This can be in the form of a Clustered PK constraint or as an explicit index.
Do NOT create any non-clustered indexes yet and don't listen to people that will rail about Trace Flag 610. It can work sometimes but about 99% of the time it won't (personal experience talking there).
Do your insert in the following form...
 INSERT INTO dbo.yourtablename WITH (TABLOCK)
        (full column list here)
 SELECT full column list here
   FROM dbo.whatever
   JOIN dbo.whatever etc
  WHERE whatever relationship is necessary to get the right data but must be SARGable
  ORDER BY list of column(s) in the exact same order as the clustered index
 OPTION (RECOMPILE) --undocumented but usually required and won't be in execution plan if you're lucky with your joins
;
The WITH(TABLOCK), ORDER BY in the correct column order, and the OPTION (RECOMPILE) are absolutely essential to get the minimal logging.
Now, that code will be slower than SELECT/INTO but it will be much faster than SELECT/INTO followed by you building a Clustered Index. It will also take half the memory, half the disk space, and a remarkably small amount of transaction log file space. In one of my tests, I built two 120 MILLION row tables with row widths of 124 bytes each (including the unseen row headers) with the clustered indexes in place in 4 minutes flat for a total of 25.153GB of input and it only used 600MB of log file space. Adding a non-clustered index to each table took an additional minute and caused no additional growth. It's a small example but the 33GB total size of the database was all created in a little over 5 minutes and each table contain 120 MILLION rows.
Let me know if there's a column that has a range that's different in the first query than in the second. If so, there's another trick I can show you to pull of parallel loads into two separate tables using the high performance method of loading above and then using a "Partitioned View" to make them look like one table.
Last but not least and when the table is loaded with the clustered index in place, THEN build your non-clustered indexes at a 100 FILL FACTOR.
Of course, you actually have already been told partially how to do this by Scott in his post above. The two things he left out is the ORDER BY to guarantee the inserts will "follow the rules" for "Minimal Logging" and the undocumented OPTION(RECOMPILE), which won't hurt if it's actually not necessary and helps guarantee that you'll get the "Minimal Logging". In that dual 120 Million row row insert test I told you about above, IT was the difference between a 4 minute load/600MB log file and a 20 minutes load/40GB log file.
I still question the need to rebuild a table with "billings of rows" every night but don't know your data well enough to make a suggestion there. Both Sergiy and Scott are also correct about the performance of the SELECTs being critical but I can't make an informed suggestion because, like I said, I don't know your data and you've not included an Actual Execution Plan in your postings. See the article at the second link in my signature line below for one of the best ways to post performance issues.

Those extra steps are not required for SQL 2016, at least according to MS internals folks. For example, your statement of needing an explicit CREATE TABLE statement? Again, I think that is for older SQL versions only.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 14

ScottPletcher wrote:

Those extra steps are not required for SQL 2016, at least according to MS internals folks. For example, your statement of needing an explicit CREATE TABLE statement? Again, I think that is for older SQL versions only.

Apologies... You basically have an explicit creation of a CREATE TABLE statement in the "rowless" SELECT/INTO statement and that will also do. The thing I'm trying to point out is that something needs to explicitly create the empty table and add the Clustered Index (like we've both done) prior to doing inserts. My CREATE TABLE isn't an "extra step" any more than your "rowless" SELECT/INTO. Either will work.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)