Introduction: Dynamic Partitioning and Dynamic Indexing, the Concept
The idea of this article is to improve or increase the usage of database partitioning and indexing features and make it work dynamically. During performance tuning, data partitions and indexes play an important role. The idea here is to make the data partitioning and indexing concept work more smartly/intelligently based on the immediate need of the data plus based on trend changes. This results in changes in the requirements of viewing/processing of data from time to time. This can be achieved by frequently changing the priority/ranking/importance/grouping of the data. Dynamic partitioning and dynamic indexing concept focuses on contributing to the performance of the application by catering to the changes in the current set of data.
Definitions of Partitions and Indexes
Partitioning - A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons. The partitioning can be done by either building separate smaller databases (each with its own tables, indexes, and transaction logs), or by splitting selected elements, for example just one table.
Indexes - A database index is a data structure that improves the speed of operations in a table. Indexes can be created using one or more columns, providing the basis for both rapid random lookups and efficient ordering of access to records
Need of catering to constant change in the demand of data
Let us consider the Online Share Market as a case study for illustration purposes. The data reading and data processing in a share market application is varied throughout the year. There might be some particular shares or types of shares based on "industry / market / market-caps / segments/political decisions / government decisions / key players' decisions/etc." that become interesting for some reason. Suddenly thousands and thousands of users are dying to look or browse for information and also to trade their shares.
From the database point of view, we can observe three points. The first point is you are not sure what type of shares people would look for, meaning this is unpredictable, data partitioning/grouping cannot be determined firmly on specific data ranges, and it varies from time to time. The second point is that suddenly there is a huge amount of hits and the database has to cater huge amount of read operation for a specific set of data. The Third point: suddenly, there is huge amount of inserts/updates hits and database has to cater data manipulation plus read operation as fast as possible to a same or similar groups/sub-groups of data.
Similarly, consider a Garments shop. Garment product sales vary throughout the year depending on the season. In summer, it is more of cotton clothes, shorts, etc. In winter it is more of woolen clothes, and in rainy season it may be more of rain proof and rain resistant clothes. In this case you are sure that most of the people would be searching for few particular product types in every season for a period of 2-3 months in the year. Now, this is a predictable data demand. From the database point, we can observe two points. The first point is you are sure that on particular months, the transaction hits would be more for few particular types of garments. The second point, after 2-3 months, other types of products come under a heavy transaction load while the current product types go out of scope.
Need of a Dynamic Partitioning/Indexing
By keeping above two scenarios in mind, the data reading/processing requirements would change from time to time. The current trend may hold good for a few days or weeks, and the data of this current trend becomes more important and demands high priority in its servicing. Since these trends are unpredictable and/or varies with time, it may not be possible to come out with a suitable data range partitioning on the tables. In this case, it requires frequent checking of data to decide which data (rows) are more important. Here, if we can introduce some kind of a process for analyzing the database reads and data manipulation through database internal logs and for analyzing physical table data changes. The result of the analysis gives clarity on the heavily transacted data sets. The process should then come out with a set of recommendation pointing out which data/set of data very frequently used. Also as a next step, process should partition and/or index the data based on these observations, thereby logically and physically grouping the most used data in one place like clustered data. By doing this you are sure that from the very next moment database able to cater to the current trends/needs of data in demand. This automated process should be executed frequently on the need basis.
Implementation of Concept
Now, let us get on to an implementation of the concept, and also issues and things to keep in mind during implementation. Let us discuss two types of data demands, predictable and unpredictable demand of data.
Please note that this is one of the ways of practical implementation of the dynamic partitioning/indexing concept being discussed, there might be much better ways to practically implement this concept, in this article focus is given more to the concept than its implementation.
Predictable demand of data
Assume that you are sure about the demand of data, and it depends on some set/range/category. Data demand may also depend on a particular period of week/month/quarter of the year, where a particular set of data becomes very important during that particular period of time. In this case, we create a table which can assist to change the priority/ranking of the rows/data of the table(s) by considering the time periods. An automated process should run periodically to check for the current time period, and the partition/priority/ranking of the data set in the tables should be changed accordingly. This is illustrated below.
Let us take up the Garments shop example. The "tblGarmentsDetails" is a table that stores all information about all garments in the shop in a denormalized structure. A table tblPartitionDecision is created for recording the data range changes to be made in the table partitions with the table name, the column on which the partition is defined, the defined partition name, data range values, effective from date, and effective to date.
A scheduled process should be designed to scan through tblPartitionDecision table, dynamically recreating partitions with data ranges defined in the colValues column on the column defined in colColumnName column for the table defined in colTableName column. Some demonstration data is given below.
tblPartitionDecision table
colTableName | colColumnName | colPartname | colValues | colFrom | colTo |
tblGarmentsDetails | colProductType | parGarment1 | 14,15,16 | 01-Apr-07 | 30-Jun-07 |
tblGarmentsDetails | colProductType | parGarment1 | 7,8,9,10 | 01-Jul-07 | 30-Dec-07 |
tblGarmentsDetails | colProductType | parGarment1 | 1,2,3,4 | 01-Jan-07 | 31-Mar-07 |
... | ... | ... | ... | ... | ... |
A scheduled automated process should process each row in the tblPartitionDecision table, change the priority (i.e. partitioning of data) of the colPartition column in the respective "colTableName" table based on "colColumnName" (column name in the table) and "colValues" (values to be considered for partition on colColumnName columns) for a time period of "colFromDate" and "colToDate" column values. It should reset the priority to general if the priority of rows is moving out of specified time period focus.
Create two partitions for "tblGarmentDetails" table, one small partition to store the top few product types, which will be in use as per parameters defined in "colFrom" and "colTo" columns. Let the other partition store all other product types. By using this, data retrieval/manipulation can be much faster because the required data for the defined time period is sliced/filtered and grouped together in one partition.
Materialized view / Indexed view approach
We can even consider using a materialized view for this purpose. Let us keep one materialized view for the above example. Create a materialized view by selecting all columns from tblGarmentsDetails table, in the where condition, provide the filter condition data i.e. "colValues" values on the "colColumnName" column. By this, optimizer can make use of materialized view when retrieving the commonly used data.
Unpredictable demand of Data
Let us take up online share market example that is presented above. Since the current data needed may cut across segments/types/industry/market-caps/etc. of the share market world; it may not be possible to identify a specific column(s) to use for database partition/index. We need an automated process to constantly check the current data requirements, as illustrated below.
Let us assume there is a data warehousing project on Share market. These points also assume the following points for discussions sake.
- Shares Data warehousing database, which stores the daily transactions of each company in a fact table. The fact is surrounded by few dimensions
- There is a dimShareDetails dimension table with other dimensions. This
dimShareDetails contains properties of each company, like the company's Sector, market cap, sic code, year end, etc.
- It is a read intensive database, a lot of reads are expected from BI tools and from stored procedures of other customized tools.
- There is a stored procedure created where all the reads have to go through this stored procedure every time data is required from the Shares database. This is to make an entry about the type/category of data being queried. Let us call the table that records the queried data count "tblSharesReadClass". It records the queried counts in a table for each day against data classification like sector, market cap, sic code, year end, etc. of the queried data. This table can be summarized to find out what classification of data is more in demand/accessed at any point in time.
Now, based on the analysis of "tblSharesReadclass", identify the top x classification or type of data, and re-create the partition or index based on this analysis. We can include more rules for analyzing the read counts; if possible, include the database system tables/logs in the analysis. An automated process can be created to work on the above steps and the process can have its own set of rules to determine data priority.
Later, based on the ranking or priority of the category of data, the specific data partition can be supported by appropriate disk/memory configurations OR simply data can be divided into multiple partitions for better performance without any ranking and hardware support. For example, in recent days in a few sectors like "IT Sector", "Steel Sector", etc., transactions may be heavy compared to other sectors. The process then simply partitions this particular group of shares' data as priority one data.
The whole process can be automated, including dropping and recreating the partitions with the assistance of a few more tables as discussed in the "Predictable demand of data" section. OR we can restrict the process to performing the analysis, and then later, with human intervention, further steps can be taken to change the partitions/indexes.
Points to be considered before going ahead with this idea in the live projects
Consider the Database Caching or Statistics mechanisms/concepts before thinking about the Analytical Dynamic Database Partitioning approach.
Consider the facts about its implementation, like running scheduled batch jobs, column addition and data updates if any.
Enhancements to Database Internals
The implementation of this idea can be considered in databases by developing the logic in the database internals itself. A parameter can be provided at database/session level to enable/disable this feature. If the Database has the internal logic to regularly assess and prioritize the priority data, this concept may yield more benefits.
Conclusion
Analysis of the behavior of the data requests, giving importance to the time-to-time trend changes, introducing intelligence within the application, we can achieve more productivity with existing infrastructure and improve the user's productivity and satisfaction.