Data Size Collection and Analysis
By now you should be well acquainted with this phenomenon we call TSQL Tuesday. This party is being hosted this month by Jes Schultz Borland (Twitter | Blog). She has challenged us to write something about aggregations and aggregation functions in SQL Server.
I stewed on this topic for a while trying to figure out something that would be relevant yet a little unique. I think that is the real challenge – finding some application of the topic that may be somewhat unique or at least informative for somebody.
It dawned on me finally that I already have a topic in queue waiting to be written. It was supposed to be a follow-up to my entry for last month. (You can read that entry here.) I concluded that entry with an admission that I hurried through the article to get it done in time. Well, I was hoping to find the time to write the rest of my process – but now it fits quite well with this months theme.
Recap
Quickly, let’s recap what I did in that post and then I will proceed from there to tie this month to last month (as far as TSQL Tuesday goes).
In last months entry, I shared a script (an ugly one) that I created to cycle through all of the tables and columns of a database to get me some relevant data concerning the size of the data in my tables. Well, actually I took a 2% sample of that information so I could run further statistical analysis. The end goal was to have relevant data from an existing system and the R&D databases to create appropriate data sizes in the new database prior to releasing it to production. With all of that data aggregated into a staging table, I was ready to being the next phase.
One thing I did not mention in that prior article was the creation of another table for this aggregation process. I didn’t mention it because it used much the same process (though considerably faster because it didn’t do the same thing). This table had a prime objective of collecting the max length of each column of each table of each database. The structure is simple:
CREATE TABLE [dbo].[DataAnalysis](
[DatabaseName] [sysname] NOT NULL,
[TableName] [sysname] NOT NULL,
[ColumnName] [sysname] NOT NULL,
[MaxLength] [INT] NULL,
[ColDataType] [VARCHAR](20) NULL,
[MaxColLength] [INT] NULL
) ON [PRIMARY]
I mention this table now because I will be using it in my final aggregation.
Statistics and Aggregation
For my data analysis and trending, I wanted to find a simple distribution across quartiles. A quartile is: One of the three numbers (values) that divide a range of data into four equal parts. A quartile is used in statistical analysis and is commonly a part of a box plot. Other statistical values that work well with the quartile include the MAX value. Since I had usable data to be able to produce my quartile ranges, I used the following query to further aggregate and even used a function that provides the quartile.
SELECT DLA.DatabaseName,DLA.TableName,DLA.ColumnName, DA.ColDataType
,DLA.ColLens,COUNT(DLA.ColLens) AS NumOccurence
,CASE WHEN DA.MaxColLength = -1 THEN 2000000 ELSE DA.MaxColLength END AS MaxSupportedLen
,DA.MaxLength AS MaxDataLen
,NTILE(4) OVER (partition BY DLA.DatabaseName,DLA.TableName,DLA.ColumnName ORDER BY DLA.ColLens) AS 'Quartile'
INTO DataStats
FROM DataLenStats_Alt DLA
LEFT Outer Join DataAnalysis DA
ON DLA.DatabaseName = DA.DatabaseName
And DLA.TableName = DA.TableName
And DLA.ColumnName = DA.ColumnName
GROUP BY DLA.DatabaseName,DLA.TableName,DLA.ColumnName,DLA.ColLens,DA.MaxColLength,DA.MaxLength, DA.ColDataType
ORDER BY DLA.DatabaseName,DLA.TableName,DLA.ColumnName,DLA.ColLens
From this script, you can see that I am taking data from the first two staging tables and dumping it into a third table. With this query I am taking advantage of the NTILE() function that is available in SQL Server (2005 and up). And since my data that has been accumulated is for more than one table, more than one database, and more than one column – I needed to partition that function based on those attributes.
With this data now available, I could see the trend of the data for any data field that may have been collected. For instance, if I wanted to figure out the proper size (based on current data) of the phone number in my new database, I could now query the DataStats table like this:
SELECT * FROM DataStats DS
WHERE DS.ColumnName like '%phone'
I would then be able to determine where that field exists and the distribution of data across the quartiles and in comparison to the max data size for that field. This helps to more intelligently assign a data size to fields based on existing data. I could quickly ascertain that most of the data is within the second quartile (for instance) and that I have few outliers in the third and fourth quartiles and maybe an extreme case where the max is way out of scope in comparison. At this point I could make an educated judgement call as to an appropriate size based on distribution, outliers and risk.
Conclusion
This exercise was a particularly challenging one. It was challenging due to the desire to create quartiles for analysis. I had wanted to break it down into Standard Deviations for further analysis (and still may). This is highly useful when in the R&D or development phase. I wouldn’t run the query from the first post on a production system because it is a long running process and can be resource intensive (I need to optimize it more). This kind of script can really help to get better acquainted with the data as well. I learned a lot by doing this and am looking forward to how I can improve upon it.