December 14, 2009 at 6:10 pm
I’ve got a database that is logging customer activity (24x7 environment), and is retaining 7 days of data. There is a delete that runs nightly, deleting the oldest day of data. When the delete job runs, it causes contention with the inserts. We’re looking at modifying the clustered index on the tables so that the data is physically in the order it is inserted (and deleted). Essentially, we’re looking at this as a first-in-first-out organization of our tables. Our thought is that this would help alleviate the contention because the inserts would be happening at the “end” of the tables and the deletes would be happening at the “beginning” of the tables.
On a clean/new table, this all seems correct. The piece I’m struggling with is what will happen after the delete job has run a few times, and data pages are reused. I believe the data will logically be in a first-in-first-out order, but physically I don’t believe that will be the case (fragmentation). But, I think this depends on how SQL decides where to insert data. I’m wondering if anyone knows of a resource that clearly explains how SQL reuses data pages (not log file pages).
I realize SQL will use the PFS, GAM/SGAM and perhaps IAM pages to find free space, but I’m not clear if there’s really any order as to how SQL will use pages. I attempted to run some tests to find out, and from what I’m seeing it’s somewhat random. The rest of this post is simply an attempt to illustrate my point, and walk through the testing I did.
First I created a test database (so there aren’t any objects except the table I’m going to use for testing), and my test table:
CREATE TABLE [dbo].[test](
[rowid] [int] IDENTITY(1,1) NOT NULL,
[text_val] [char](1600) NULL,
CONSTRAINT [PK_test] PRIMARY KEY CLUSTERED
(
[rowid] ASC
)
I put a PK/clustered index on my auto-incrementing identity column, and I created a CHAR(1600) so that I get 5 rows of data onto a page. This code will insert 40 rows of data into my test table:
declare @total int, @curr int
set @total = 40
set @curr = 1
while @total >= @curr
begin
insert into test (text_val)
values ('test'+CAST(@curr as varchar(10)))
set @curr = @curr + 1
end
I run this code 5 times. If I check the fragmentation, it’s very low. I then run the following query so I can see the pageID for each row in my table:
SELECT sys.fn_PhysLocFormatter (%%physloc%%) AS [Physical RID], *
FROM test;
I verify I have 5 rows per page, and then run the following code to only return every 5th row so that I can condense the data (I don’t need to look at every single row, I just want to see the pageID that the rows are going on).
create table #temp (physloc varchar(100), rowid int, text_val varchar(2000))
insert into #temp(physloc, rowid, text_val)
SELECT sys.fn_PhysLocFormatter (%%physloc%%) AS [Physical RID], *
FROM test;
GO
select * from #temp
where rowid%5 = 0
drop table #temp
I’m not sure why a few pagedID’s are skipped, but you can see my data and pagedID’s are in ascending order.
Physical RIDRowIDText_Val
(1:153:4)5test5
(1:156:4)10test10
(1:157:4)15test15
(1:158:4)20test20
(1:159:4)25test25
(1:168:4)30test30
(1:169:4)35test35
(1:176:4)40test40
(1:177:4)45test5
(1:178:4)50test10
(1:179:4)55test15
(1:180:4)60test20
(1:181:4)65test25
(1:182:4)70test30
(1:183:4)75test35
(1:184:4)80test40
(1:185:4)85test5
(1:186:4)90test10
(1:187:4)95test15
(1:188:4)100test20
(1:189:4)105test25
(1:190:4)110test30
(1:191:4)115test35
(1:192:4)120test40
(1:193:4)125test5
(1:194:4)130test10
(1:195:4)135test15
(1:196:4)140test20
(1:197:4)145test25
(1:198:4)150test30
(1:199:4)155test35
(1:200:4)160test40
(1:201:4)165test5
(1:202:4)170test10
(1:203:4)175test15
(1:204:4)180test20
(1:205:4)185test25
(1:206:4)190test30
(1:207:4)195test35
(1:208:4)200test40
Now I delete the first 40 rows. This is simulating deleting the oldest day’s worth of data.
delete from test where rowid <= 40
I then insert another 40 rows. After doing so, I rerun my fn_PhysLocFormatter query to see which pages the new rows landed on. I’ll include only the pertinent data here, since most of it is repeated from above.
Physical RIDRowIDText_Val
(1:177:4)45test5
………
(1:201:4)165test5
(1:202:4)170test10
(1:203:4)175test15
(1:204:4)180test20
(1:205:4)185test25
(1:206:4)190test30
(1:207:4)195test35
(1:208:4)200test40
(1:156:4)205test5
(1:157:4)210test10
(1:158:4)215test15
(1:159:4)220test20
(1:168:4)225test25
(1:169:4)230test30
(1:209:4)235test35
(1:210:4)240test40
You can see it didn’t reuse page 153, but then reused pages 156, 157, 158, 159, 168 and 169. It then jumped to pages 209 and 210. Note that it didn’t reuse page 176.
Then I insert another 40 rows, and here’s where that data landed.
Physical RIDRowIDText_Val
(1:208:4)200test40
(1:156:4)205test5
(1:157:4)210test10
(1:158:4)215test15
(1:159:4)220test20
(1:168:4)225test25
(1:169:4)230test30
(1:209:4)235test35
(1:210:4)240test40
(1:211:4)245test5
(1:212:4)250test10
(1:213:4)255test15
(1:214:4)260test20
(1:215:4)265test25
(1:176:4)270test30
(1:216:4)275test35
(1:217:4)280test40
You can see it picked up where we left off, filling page 211, 212, 213, 214 and 215, but then jumped back and reused page 176, then went back to 216 and 217. Of course if you check the fragmentation, the table is becoming fragmented. If you continue to do a bunch of inserts followed by some deletes, etc., you can see the table slowly becomes more fragmented, and you can see your rows get put all over the place as SQL reuses pages.
Again, my question is if anyone knows of a detailed resource to explain how SQL reuses data pages, and/or what the logic is that SQL follows to decide where a new row will be inserted.
Thanks,
Adam
December 14, 2009 at 7:45 pm
First, the temporal clustered index you mention will help a lot. There's another trick associated with this type of "rolling delete" but I have a couple of questions, first.
1. How many rows do you believe you'll delete at a time?
2. What are the other indexes/keys on the table?
3 Any triggers or indexed views on the table?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 15, 2009 at 2:29 am
amoericke (12/14/2009)
I’ve got a database that is logging customer activity (24x7 environment), and is retaining 7 days of data. There is a delete that runs nightly, deleting the oldest day of data. When the delete job runs, it causes contention with the inserts.
Are you running Enterprise edition? If so, take a look at table partitioning. With that, you can delete data with a metadata operation (drop an entire partition). It's far faster and won't interfere with your inserts.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 15, 2009 at 6:36 am
We've been talking a lot about partitioning, but we're running Standard Edt. and the business doesn't want to pay to Enterprise. We're kicking around some ideas on how we could implement our own partitioning.
December 15, 2009 at 6:39 am
1. How many rows do you believe you'll delete at a time?
--> The database is a star schema. We'd delete around 6 million rows from the main table. Most of the 50+ other tables are much smaller. We'll delete about 5+ GB of total data.
2. What are the other indexes/keys on the table?
--> The 50+ tables in the real database have many variations of indexes on them, but we've tried to keep indexing at a minimum.
3 Any triggers or indexed views on the table?
--> The database does not have any triggers or views.
Thanks much to everyone for taking the time to read through my long post, and offer up some suggestions!
December 15, 2009 at 7:36 am
amoericke (12/15/2009)
1. How many rows do you believe you'll delete at a time?--> The database is a star schema. We'd delete around 6 million rows from the main table.
Wow! How many total?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 15, 2009 at 7:51 am
We retain 7 days of history. Right now the main table has around 40 million rows. Keep in mind that we won't have nearly as much data logged on the weekends as we do on week days.
December 15, 2009 at 8:00 am
Partitioning would make that a piece of cake. Since it's not an option, try one table per day, each day drop the oldest, create a new one and do some renaming. Then a view for queries that need to aggregate across all the tables (keep the ones that just need one day's data accessing just the one day's table.
Complex, but doable.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 15, 2009 at 8:12 am
Yup, that's the sort of "home-grown" partitioning we've been talking about.
The other idea was to change the clustered indexes to try to physically seperate the inserts from the deletes (the FIFO idea). This seems like it'd help, but I agree that partitioning would be a much bigger win.
Thanks.
December 15, 2009 at 8:42 am
amoericke (12/15/2009)
The other idea was to change the clustered indexes to try to physically seperate the inserts from the deletes (the FIFO idea). This seems like it'd help, but I agree that partitioning would be a much bigger win.
From personal experience, it doesn't help all that much. The problem is locking. With 6 million rows to delete, row locks take way, way too much memory. SQL will likely escalate to table, thus hurting the inserts. You can force page or row locks, but then watch the memory impact. Running out of lock memory is not a fun thing.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply