October 1, 2015 at 5:05 am
Im just starting to get into this topics and looking for general advice on how to set up the partitions.
I have a transaction table with 50 million records that's very hard to query. it holds data for the last 4 years but the application only ever looks at the last 6 months so i believe this is and ideal candidate for partitioning.
Would it be better to
1) create a partition based on each year for all data so would have a 2015, 2014, 2013, 2012?
2) create 1 partition based on month for this years data then 3 based on year so would have jan,feb,march,april,may..., 2014, 2013, 2012
For 1) would you have to perform some maintenance at the turn of each year for accommodating the next years data. For 2) although this would give better performance as query's are mostly in the last 6 months wouldn't this have more maintenance to move month data to year partitions come the turn of the year and then create the next years months partitions.
Maybe im over complicating things and the norm is to just to create year partitions, whats the general feeling on this
October 1, 2015 at 7:33 am
If I've understood correctly, then option 2 isn't possible. I believe MySQL allows partitions within partitions but I've never looked at it. With SQL Server you can have one partition scheme per table.
In your case I'd partition by month and just keep the last 6.
This can be a bit of work to set up for an existing table and it won't necessarily make your queries faster - just help you remove large amounts of data easily. If your table has a small amount of data added each day I'd be tempted to just run a job that deletes/archives rows older than 6 months.
An example of purging by partition switching is at http://sqlrambling.net/category/sql/partitioning/ based on three days of retention. It can be altered quite easily to 6 months.
October 1, 2015 at 7:50 am
Thanks,
I thought i read somewhere that you can move data from one partition to another. So in my example at the end of the year all the data in the month partitions could be moved into a newly created 2015 year partition and then the 2016 data would go into the month partitions. This is knew to me so its possible if picked things up wrong (or possibly this was done by partition switching)?
October 1, 2015 at 8:02 am
ps_vbdev (10/1/2015)
Thanks,I thought i read somewhere that you can move data from one partition to another. So in my example at the end of the year all the data in the month partitions could be moved into a newly created 2015 year partition and then the 2016 data would go into the month partitions. This is knew to me so its possible if picked things up wrong (or possibly this was done by partition switching)?
Yes, you can move data from one partition to another but the partitions have identical layouts. If you move a partition from one table to another then the two tables have to have identical structures - including the partition calculation.
What you could do, instead of using partition switching to just purge data, you could switch the partition no longer required to another table and instead of truncating that new table move the data to a table used to store the year's data. Or you could move the data out into another database that has data for the year, for reporting purposes.
Another option - we have several databases here where only a few days worth of data is kept in the 'main' database and partition switching is used to remove older data at set periods. In the meantime replication moves data continually to other databases for reporting/analysis, so the long-term stuff is out of the way on another server.
October 1, 2015 at 8:37 am
ps_vbdev (10/1/2015)
I have a transaction table with 50 million records that's very hard to query. it holds data for the last 4 years but the application only ever looks at the last 6 months so i believe this is and ideal candidate for partitioning.
Actually, it's not. There's no reason other than improper code, improper criteria, or improper indexes for this to be slow. I'll also tell you that partitioning might actually make your performance worse. You need to fix the code even if you decide to partition later on.
Shifting gears a bit, partitioning should never be done for reasons of code performance because it frequently doesn't help at all. The best reasons for partitioning is to decrease the time for backups, enable the ability to do piece-meal restores, and to decrease the time it takes to do index maintenance. There are also a ton of caveats associated with both partitioned views and partitioned tables that you really need to consider the ramifications of before you partition.
My general advice is that if you thought partitioning a table was easy, then you've done something wrong. You either missed out on some of the more important features, you're wasting disk space, or you've slowed down the processes that address the table.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 1, 2015 at 9:31 am
Thanks Jeff
That's interesting points.
I wasn't partitioning the production table. Incrementally im pulling each days transaction over to an exact replica of the table in an archive database. My plan is to use this table for reporting needs as querying the production table slows systems down, it is this table that I thought would benefit from partitioning as most of the reports only look back the last 6 months but would still need access to older data if required.
I do plan to start deleting records in production that have been archived but this is a future project.
October 1, 2015 at 1:23 pm
ps_vbdev (10/1/2015)
it is this table that I thought would benefit from partitioning as most of the reports only look back the last 6 months but would still need access to older data if required.
Partitioning is not about performance, and with good indexes your queries will read only the data they need anyway. Partitioning is for easy load/remove of rows (switching partitions to other tables, not moving data from one partition to another) and index rebuilds at the partition level.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 5, 2015 at 2:37 am
Oh OK
somehow ive picked this up wrong. I was under the impression that partitioning the data would give benefit in the form of sql server not having to look at the older data. So for my case of 50 million records, would i be better to re-index the table on the date field?
October 5, 2015 at 5:25 am
ps_vbdev (10/5/2015)
Oh OKsomehow ive picked this up wrong. I was under the impression that partitioning the data would give benefit in the form of sql server not having to look at the older data. So for my case of 50 million records, would i be better to re-index the table on the date field?
Probably but, to be sure, can you post the CREATE TABLE statement (along with indexes and keys) for the current condition of the table and the most commonly used query(ies)?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 5, 2015 at 6:55 am
ps_vbdev (10/5/2015)
somehow ive picked this up wrong. I was under the impression that partitioning the data would give benefit in the form of sql server not having to look at the older data.
Oh you *can* get partition elimination, but it's not guaranteed.
So for my case of 50 million records, would i be better to re-index the table on the date field?
Probably, though need to see the table definition and common queries to be sure.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 5, 2015 at 7:22 am
Current table create statement:
CREATE TABLE [dbo].[StockTransaction](
[Id] [nvarchar](128) NOT NULL,
[CreatedDate] [datetimeoffset](7) NULL,
[StockTransactionType] [nvarchar](max) NULL,
[UserId] [nvarchar](max) NULL,
[StockItem_Id] [nvarchar](128) NULL,
[WarehouseLocation_Id] [nvarchar](128) NULL,
[UserLocationId] [nvarchar](128) NULL,
PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
currently most of queries will be based around CreatedDate & ID.
I was also looking at partitioned views at the weekend and although there would be some initial setup time in chopping the big table into 4 smaller yearly tables in my ssis package i think this could be beneficial. Once the data is in it never changes and i can point my queries at the latest table but can also create partition view if we need to look at historical data.
October 5, 2015 at 7:30 am
Please post the most common queries.
Also, why Datetimeoffset? It's not a common data type choice, especially since it's not daylight savings aware, which reduces the usefulness of storing a time zone
Why is UserID an NVarchar(Max)? I can't recall last time I saw a userID over 4000 characters long.
Why is StockTransactionType nvarchar(max)? You really have types that are several thousand characters long?
Why are there ID columns which are nvarchar?
Fixing what looks like some design problems will get you much better performance gains than partitioning. The MAX data types are probably killing your query performance, they and the nvarchar IDs may well be why the table is so hard to query.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 5, 2015 at 7:33 am
ps_vbdev (10/5/2015)
Current table create statement:CREATE TABLE [dbo].[StockTransaction](
[Id] [nvarchar](128) NOT NULL,
[CreatedDate] [datetimeoffset](7) NULL,
[StockTransactionType] [nvarchar](max) NULL,
[UserId] [nvarchar](max) NULL,
[StockItem_Id] [nvarchar](128) NULL,
[WarehouseLocation_Id] [nvarchar](128) NULL,
[UserLocationId] [nvarchar](128) NULL,
PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
currently most of queries will be based around CreatedDate & ID.
I was also looking at partitioned views at the weekend and although there would be some initial setup time in chopping the big table into 4 smaller yearly tables in my ssis package i think this could be beneficial. Once the data is in it never changes and i can point my queries at the latest table but can also create partition view if we need to look at historical data.
Looking at this table, the first thing that I would suspect would be implicit conversions occurring with almost every query.
What is contained in the ID field? Is this actually numeric data?
Why is there a need to use nvarchar(max) for the StockTransactionType and the UserID fields? This seems abnormally large for this data.
Do you have sample queries and data that you can also post?
Are there no other indexes besides the primary key?
Sorry, I must have been typing at the same time Gail clicked "Post"
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
October 5, 2015 at 8:15 am
Its another team that are in-charge of the source db so Im not sure why these datatypes have been selected, I know c# guids are used all over the place for ids (UserID & StockTransactionType) but they never exceed 36 nvarchar. bearing in mind im pulling this data over into my reporting db which im just building the only query that is running against it just now is
SELECT
DATENAME(MONTH, pish.[InductedDate]) + ' ' + CAST(YEAR(pish.[InductedDate]) AS CHAR(4)) [InductedMonth],
CAST(pish.[InductedDate] AS DATE) [InductedDate],
DATEPART(hh,pish.[InductedDate]) [InductedHour],
pish.[Barcode] [InductBarcode],
pish.[ItemType] [InductItemType],
pish.[ItemDefinition] [InductItemDefinition],
pish.[SKU],
pish.[SizeCode],
pish.[MovingTo]
FROM
[Picking].[InductScanHistory] pish
WHERE
DATEPART(isowk, pish.[InductedDate]) >= DATEPART(isowk, GETDATE())-1 AND pish.[InductedDate] <= DATEADD(ms, -3, DATEADD(dd, DATEDIFF(dd, 0, GetDate()), 0))
ORDER BY
pish.[InductedDate]
but anything other will most probably be filtered by the InductedDate and grouped Barcode,ItemType,SizeCode.
October 5, 2015 at 8:35 am
WHERE
DATEPART(isowk, pish.[InductedDate]) >= DATEPART(isowk, GETDATE())-1 AND pish.[InductedDate] <= DATEADD(ms, -3, DATEADD(dd, DATEDIFF(dd, 0, GetDate()), 0))
That is killing you.
You have over complicated this. You are trying to get the data for the previous 7 days, correct?
Declare variables, do your calculations on them, and then do the comparison. When you do conversions/date math/etc. on the columns in your where clause, the is not "sargable". The optimizer cannot use this in an index, so it likely does a scan.
One possible manner to do this:
DECLARE @StartDate datetime = CONVERT(datetime, CONVERT(date, DATEADD(day, -7, getdate())))
DECLARE @EndDate datetime = CONVERT(datetime, CONVERT(date, DATEADD(day, 1, getdate())))
WHERE pish.[InductedDate] >= @StartDate AND pish.[InductedDate] < @EndDate
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply