Sustaining the steadily increasing work load is essential for an application.
If the application uses database as a repository to store data, then that
database architecture must be flexible enough to withstand the different ways
and means of improvising the scalability. Database’s sustainability and
adaptability for the software and hardware level changes, will minimize the work
involved in changing your software application.
Scalability Vs Availability
Let’s take an example, where you have a web mart which runs on IIS using SQL
Server as a data repository. To serve your day by day increasing number of
in-coming traffic 24/7, you need to have your system up and running all the time
with an efficient load management at the middle-tier as well as at the data-tier
levels.
To achieve the scalability and availability at the middle-tier level, Windows
provides network load balancing (NLB), which is part of Windows 2000 Advanced
and Datacenter editions. Network load balancing distributes IP traffic to
multiple instances of web servers each running on a host within the cluster,
from the client perspective the cluster appears to be a single server through
the virtual IP Address. NLB can host 32 servers in a cluster to provide
scalability and if it detects any failure in one of those servers, it can
re-route the request to rest of the servers and hence NLB assures that your
middle-tier services are highly available and scalable.
To achieve the availability at the data-tier level, Windows provides
Microsoft Cluster Service (MSCS). MSCS manages group of nodes that are
responsible for hosting your SQL Server database. Microsoft cluster is a shared
nothing cluster, means at a time only one node can host a database. If one of
the SQL Server nodes goes down, MSCS can restart the SQL Server instance in
another node. MSCS does not balance the load on your SQL Server database. To
balance the load on your SQL Server, consider scaling-up and scaling-out
solutions for your database environment.
Scaling-up is increasing the resources like memory, CPU speed, etc., Scaling
out is adding additional computers and leveraging the benefits of parallel
computing. Scaling- up is vertical solution and scaling out is horizontal
solution.
Scaling-up is more at hardware level and it doesn’t require any changes in
your database design. But, if your database workload demands more processing
that you cannot achieve by keep on adding additional resources to your single
database server then consider linking additional servers.
Scale Out Solutions
Microsoft SQL Server supports two different types of scale-out solutions;
adaptability of these two depends on your environment, database design time and
the cost of maintainability.
First solution is at the database level. This involves creating similar
databases in all the servers and timely synchronizing data for the transactional
consistency. The database can be based on regions, for example your sales
database in US can serve America customers and similarly the replica of the
sales database in London can serve Europe customers. This way you can minimize
the traffic based on regions. This approach is called database replication, the
disadvantage in replication is the time involved in doing the synchronization to
achieve transactional consistency and the complexity involved in doing the
coordinated backup and recoveries between the replicated databases.
Second solution is at the table level. This involves breaking a huge table
into smaller tables and creating schematically similar smaller tables on
individual database servers and connecting them through an updatable view. The
concept of linking individual servers together to provide a consolidated view of
data is called database federation. Since data of one or more tables is accessed
through an abstracted layer (views), this solution provides transactional
consistency and with the SQL Server support for SAN-VI architecture, the
federated database solution assures the maximum scalability.
This article focuses on the second scale-out solution.
Database Federation
Database federation can be defined as a group of autonomous servers that work
together to take the work load. Each server in a group is managed independently
and they can have different maintenance procedures. This group of servers does
not have any physical private inter-connect like in MSCS. Servers in these
groups can be geographically apart and logically linked.
Database federation can connect number of heterogeneous data sources.
Microsoft OLEDB plays a vital role in linking and providing uniform access to
data. With the advent of .Net and its newly defined ADO.NET, you can even link
web service with your relational data. In simple words, database federation
helps you to consolidate heterogeneous data in a monolithic fashion without
transforming data from one form to another.
Setting up the Link
To link SQL Server B to SQL Server A, execute the sp_addlinkedserver stored
procedure on server A as shown below:
Use master Go Exec sp_addlinkedserver ‘ServerB’, N’SQL Server’
When you explicitly specify that you are linking to a SQL server through the
second parameter of the sp_addlinkedserver stored procedure, you do not need to
specify other parameters such as OLEDB provider name, data source and provider
string etc. SQL Server by default uses SQLOLEDB provider for the linkage.
Simply, linking to a server may not let you access the data in the remote
server because the remote server characteristics can be different from your
local server characteristics. These characteristics are security context,
collation etc.
To get the list of linked servers of a local server, execute sp_linkedservers
stored procedure on a local server.
Managing the Security Context
When you establish the link, by default the connection’s security context is
used to access the remote server. The user account which is used in the local
server connection may not exist in the remote server. For this map a local
connection account to a remote server account. Use sp_addlinkedsrvlogin stored
procedure to control the security context of linked servers.
The following example maps domain\localuser of server A to an account in
server B
sp_addlinkedsrvlogin ‘ServerB’, ‘false’, ‘domain\localuser’, ‘ServerB_User’,’ServerB_UserPassword’
To map all the accounts of server A to a specific account in server B use
sp_addlinkedsrvlogin as shown below:
sp_addlinkedsrvlogin ‘ServerB’, ‘false’, NULL, ‘ServerB_User’,’ServerB_UserPassword’
If the second parameter is true, then the local server security context is
used to connect to the remote server. If your local connection uses Windows
authentication, your local server must support security account delegation. To
delegate an account, you must be running Windows 2000 with Kerberos support
enabled and you must be using Active Directory services of Windows 2000. For the
delegation to work enable and disable the following options in your AD domain.
Enable these two:
- Account is trusted for delegation
- Computer is trusted for delegation
And disable the option Account is sensitive and cannot be delegated
To drop a specific local login mapping with the linked server, execute the
stored procedure sp_droplinkedsrvlogin on the local server.
Use Master
Execute sp_droplinkedsrvlogin ‘ServerB’, ‘locallogin’
To remove the default login mapping originally created by sp_addlinkedserver
execute sp_droplinkedsrvlogin as follows:
Use Master Execute sp_droplinkedsrvlogin ‘ServerB’, NULL
Linked Server Options to Improve Query Performance
When you link more than one server for scalability the performance of the
query execution is very important. Consider the following server options, to
improve the query performance.
Use master sp_serveroption ‘serverB’, ’lazy schema validation’, ‘true’
When you turn on the lazy schema validation option, the SQL Server query
engine will not check the schema until the query execution. If the change of the
schema in the remote table is detected at the execution time, the query will
fail with an error.
Similarly, turn on collation compatible option when the linked server
character set and sort order is same as local server. By turning on this option,
local server delegates the character column comparison to the OLEDB provider
used for the remote table. Because OLEDB provider caches all the remote table
data, the comparison can be performed at the provider level using internal index
mechanism of the provider. If this option is false all row data will be sent to
local server for the comparison, this will increase the amount of data transfer
over the network and will impact the performance.
Use master sp_serveroption ‘serverB’, ‘collation compatible’, ‘true’
Queries involving bit and uniqueidentifier columns are always evaluated
locally and cannot be delegated to the provider
Disconnecting the linked server
To drop the linkage use the sp_dropserver stored procedure as shown below:
Use master Go Exec sp_dropserver ‘B’, NULL
If you do not specify the second parameter, the default value of NULL will be
assumed..
To drop the linkage and associated remote logins from the local server,
execute sp_dropserver stored procedure as shown below:
Exec sp_dropserver ‘B’, ‘droplogins’
Partitioning data
After you set up the linkage, the next step is partitioning your data. This
involves, breaking a huge table into smaller tables and distributing across the
linked servers. These smaller tables will have the identical schema except the
validation criteria on columns on which the whole data partition gets devised.
These columns are called as data partition columns (DPC). So when you partition
data, note down the following:
- DPC needs to be part of your primary key constraint and hence all the
primary key column rules will apply to DPC.
- Columns in the base table cannot have default values.
- If the base table references a foreign key, keep the foreign key table’s
partition in the same server along with the base table.
Partition can be symmetric or asymmetric. Symmetric partition is an idle
partition to achieve, in symmetric partition data is partitioned uniformly
across the member servers. For example, consider a system where a customers
table is partitioned across Server A(CustomerID from 1 through 3299999) and
ServerB (CustomerID from 3300000 through 6599999). When you have more complex
data access patterns, you may not be able to achieve symmetric data partition.
In this case, asymmetric partition helps. In an asymmetric partition, the tables
which are good candidate for partitioning get partitioned, but the tables which
do not qualify for partitioning stays in the local server. This way you moved
some data to linked servers and hence minimized the work load on the local
server. In other words, asymmetric partitioning is better than having no
partition at all.
Accessing Partitioned Data
When the total number of data row increases and it crosses the maximum size
of your partition, you need to create a new table and also you need to change
your application to reference the newly created table. Whenever you create a new
partition, changing your application will incur overhead on time and cost.
To avoid this, SQL Server supports abstracting your data through partitioned
views. A partitioned view links one or more tables. If a base table of a view
comes from a remote server the partitioned view is called as distributed
partitioned view (DPV).
If you do not use partitioned view, you have to code the logic to locate a
member table.
For example, the below code fetches customer data from ServerB, if the
customerID falls in between 3299999 and 6599999, otherwise locale member table
will be accessed.
If(@customer_id < 3299999) BEGIN … Select * from ServerA.northwind.dbo.Customers … END … ELSE IF (@customer_id > 3299999 AND @customer_id < 6599999) BEGIN … Select * from ServerB.northwind.dbo.Customers … END …
If you are adding one more member table, the above code snippet needs to be
changed.
Instead, if you use partitioned view, you have to change the underlying view
definition to include the new member table and it does not require any change in
your application because your application references the partitioned views not
the member tables.
For example, the above code snippet will be split into two, one is DPV
definition and another one is actual code which calls the DPV.
Here is view definition
CREATE VIEW DPV_Customers AS SELECT * FROM northwind.dbo.Customers UNION ALL SELECT * FROM ServerB.northwind.dbo.Customers
And call the above DPV in your code as shown below:-
…
Here is the modified code:
SELECT * FROM DPV_Customers WHERE CustomerID=@customer_id
In the above code, SQL Server query processor routes the request and fetches
the data from ServerB. This routing mechanism is based on the underlying base
table’s CHECK constraint and the link definition. In your query always reference
the remote table using four part name as shown below.
<Server Name>.<Database Name>.<Owner Name>.<Object Name>
When you design a query for your federated environment, note down the following
constraints:
- Unlike SQL Server 7.0, in SQL Server 2000, partitioned views are
updateable. In order for your partition to be updateable, the underlying
table primary key needs to include the partitioned column, otherwise you
have to use INSTEAD OF trigger to do the update and remember to call BEGIN
DISTRIBUTED TRANSACTION statement in your trigger script.
- It is essential to use CHECK constraint on the base table’s partition
column otherwise the query optimizer will search all the tables in the
federation.
- SELECT statements INTO clause referring to a remote table is not allowed
- DDL statements like CREATE, ALTER and DROP are not allowed against the
linked server.
- Text manipulation functions like READTEXT, WRITETEXT and UPDATETEXT
statements are not allowed.
- GROUP BY ALL is not allowed when the WHERE clause is specified, but
GROUP BY without specifying ALL is supported.
- ORDER BY clause cannot be used if your query references BLOB columns of
a remote table.
- IS NULL and IS NOT NULL cannot reference BLOB columns of a remote table.
Indexed Views
If you have index on the partitioned views, query performance can be further
improved,. When you index a view the result set of the index is materialized at
the time the index is created. When you include indexed views in your query, SQL
Server will compare the cost of fetching data from the indexed view with the
cost of fetching data from the remote query and pick the cheaper strategy. If
you design indexed views carefully, the performance benefits are not limited to
queries that specify the indexed view in their FROM clause, the performance
benefits apply to any query that references data covered by the indexed view.
Indexed views depend on base tables for their data. Such dependency means
that if you change a base table contributing to an indexed view, the indexed
view might become invalid, so use indexed views on a table, whose data is not
going to change. For example, if one of your base table references the last
year’s transaction, use index view on that base table.
SAN-VI
In a federated environment, group of servers connected by a low latency and
high speed inter connect, this type of LAN or WAN is called System Area Network
(SAN). Virtual Interface (VI) architecture is used as a protocol for
communication in SAN environment. VI replaces TCP/IP, unlike TCP/IP; VI uses
direct memory copy technology between servers through VI communication channels.
Microsoft SQL Server 2000 supports VI architecture and also offers tools to work
with VIA.
For VIA to work, you must use the supported hardware like cLan from Giganet.
Refer to http://www.viaarch.org, http://www.giganet.com and http://www.Microsoft.com
for more information about VI architecture.
Backup and Restore of Database Federation
Manageability of your environment can become more challenging when more
servers are added to your federation. The backup and restore features that are
included with SQL Server 2000 become more complex when working in a federated
environment.
SQL Server federated database environment does not require that you coordinate
backups across member servers. Backups can be taken from each database
independently, without regard for the state of the other member databases.
Because the backups do not have to be synchronized, there is no processing
overhead for synchronization and no blockage of running tasks.
But you need to coordinate recoveries across member servers to restore databases
at the same logical point in time. SQL Server 2000 has made these restores
easier with the implementation of marked transactions. You can restore the
database exactly at the mark or before the mark. These marks are stored in MSDB
database.
Here is the nicer example from SQL Server Books online, which describes recovery
using transaction marks.
BEGIN TRANSACTION RoyaltyUpdate WITH MARK 'Update royalty values' GO USE pubs GO UPDATE roysched SET royalty = royalty * 1.10 WHERE title_id LIKE 'PC%' GO COMMIT TRANSACTION RoyaltyUpdate GO --Time passes. Regular database --and log backups are taken. --An error occurs. USE master GO RESTORE DATABASE pubs FROM Pubs1 WITH FILE = 3, NORECOVERY GO RESTORE LOG pubs FROM Pubs1 WITH FILE = 4, STOPATMARK = 'RoyaltyUpdate'
By restoring, using the transaction mark in all the member servers, you can
achieve transactional consistency for your data in the federated database
environment.
Conclusion
Setting up the scale out environment using database federation is a trade off
between manageability and scalability. By using, SQL Server tools, you can
reduce the complexity in managing your federated databases and can achieve the
effective scalable environment.
Also the growing research in the areas like Infiniband architecture will
answer your future growing demand for a scalable solution.
References
1. Microsoft SQL Server Books Online
2. http://www.dell.com/downloads/global/vectors/via_sql2000.pdf
3. http://www.research.ibm.com/journal/sj/414/haas.pdf
4. http://www.viaarch.org