Background
In our Enterprise Data Warehouse (EDW) environment, exact copies of many source systems exist. We pull over tables exactly as they exist in the upstream clinical systems. When a clinical system is upgraded, there are often numerous changes that occur in its database including new tables being created, tables being dropped, column types changing, columns being deprecated, etc. These changes are reflected in our EDW.
Our power user community writes SSRS reports against these tables within our EDW. Many reports will break in one way or another after a clinical system is upgraded.
Problem
Many SSRS reports will break in one way or another after a clinical system is upgraded and its database changes are reflected in our EDW.
Proposed Solution
A discrete list of database objects referenced by each report will be created (e.g. Report X references tables X, Y, and Z).
Since the vendor provides release notes and a copy of the new version exists in a test environment, the upcoming database changes have already been identified. Knowing the list of tables that have changed, we'd like to identify any reports that reference a table in that list.
This approach will be covered in four steps:
- Extracting SSRS reports from the report catalog on the report server.
- Extracting DataSet queries from those SSRS reports.
- Using a regular expression, extract any string that looks like a database object from those queries.
- Matching the resutls found in the previous step to actual objects in the database.
Step 1: Extract Report Catalog from Report Server
SSIS is used to extract report data from the report server's catalog table and store it on our EDW server. Of note is the Content column in the report server's dbo.catalog table. This column is typed as image and stores the RDL XML file for the report.
The destination table on our EDW server uses the XML data type for the Content column, which enables the use of XML methods. All other columns in the destination table are typed the same as the source table.
The source query for pulling report data from our Report Server is simply the following:
select c.ItemID, c.Path, c.CreationDate, c.ModifiedDate, c.Content
from dbo.Catalog c with(nolock)
where c.Type = 2 -- Report
Note: Microsoft promotes using web services to interact with the report catalog rather than hitting the database directly.
Step 2: Extract Queries from SSRS Reports
Now that a copy of the report catalog exists in our EDW complete with an XML column storing the report's RDL, SQL Server's built-in XML capabilities make it relatively easy to extract a report's T-SQL. The table that now stores the report data is called metadata.reports.
The T-SQL below shows using the XML nodes() method to return each DataSet in a report as a row of data. The XML value() method is then used to pull out pieces of each DataSet's XML notably the source query in the CommandText column.
;WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition') -- setup namespace
select
r.ItemID,-- report unique ID
r.Path,-- report path
DS.Item.value('@Name','varchar(100)') as DataSet,-- Name attribute of DataSet element
DS.Item.value('(./Query/CommandText)[1]', 'varchar(max)') as CommandText,-- DataSet's query
DS.Item.value('(./Query/DataSourceName)[1]', 'varchar(100)') as DataSource-- Connection
from metadata.reports r with(nolock)
cross apply r.Content.nodes('/Report/DataSets/DataSet') as DS(Item) -- then, join the datasets as rows
Now that each DataSet's query has been extracted, it's time to mine that text to pull out any string that resembles a database object.
Step 3: Extract Any String that Resembles a Database Object
A colleague of mine, Eric Just, developed an open source SSIS Data Flow Task component named RegExtractor (http://regextractor.codeplex.com/). The component "allows you to define a regular expression with captured matches (backreferences) that you can apply to a column in your data flow. The captured matches are output as new columns in the data flow."
So, the regular expression used in this case looks for any string that could resemble a database object in the form of: database_name.schema_name.object_name. This will cast a wide net and pull out many strings that are not actually database objects.
The actual regular expression used is: (?<database_nm>\[?\w+\]?\.)?(?<schema_nm>\[?\w+\]?)\.(?<object_nm>\[?\w+\]?)
RegExtractor will process a SQL statement and output three columns for any strings that match the regular expression: database_nm, schema_nm, and object_nm.
As an example, let's assume the following query has been used in a report titled Employee Pay to extract a list of employees from AdventureWorks and their current pay rate:
select e.EmployeeID, current_rate.Rate, current_rate.RateChangeDate
from HumanResources.Employee e
cross apply (
select top 1 eph2.Rate, eph2.RateChangeDate
from HumanResources.EmployeePayHistory eph2
where eph2.EmployeeID = e.EmployeeID
order by eph2.RateChangeDate desc
) current_rate
order by e.EmployeeID
The regular expression above would pull out rows similar to the following:
database_nm | schema_nm | object_nm |
null | e | EmployeeID |
null | HumanResources | Employee |
null | eph2 | Rate |
null | HumanResources | EmployeePayHistory |
A picture of the SSIS Data Flow Task using RegExtractor in our production environment is shown below. A total of 3,367 SSRS DataSet queries were processed resulting in over a 116,000 possible references to database objects.
Now that a wide net has been cast to pull out strings matching the regular expression noted above, the next step is to compare each string to actual database objects and only keep those that match.
Step 4: Resolve List of Strings to Actual Database Objects
Using RegExtractor in the previous step, a table now exists that lists possible objects found in our SSRS DataSet queries. An inner join to the system catalog views in the current database will return a list of potential objects that are matched to real objects.
For illustration purposes, the AdventureWorks query in Step 3 will be used in the sample below. The sample below sets up a table variable to hold our list of potential database objects extracted from the AdventureWorks query. That list of potential objects is matched to actual objects based on the schema name and object name as found in SQL Server's catalog views.
use adventureworks;
--
-- create a sample table that references a list of potential objects
declare @potentials as table (
report_nm varchar(100),
database_nm varchar(100),
schema_nm varchar(100),
object_nm varchar(100)
)
--
-- add a list of potential objects to this table
insert into @potentials(report_nm, database_nm, schema_nm, object_nm)
values('Employee Pay', null, 'e', 'EmployeeID'),
('Employee Pay', null, 'HumanResources', 'Employee'),
('Employee Pay', null, 'eph2', 'Rate'),
('Employee Pay', null, 'HumanResources', 'EmployeePayHistory')
--
-- resolve the list of potential objects to actual objects using object and schema names
select distinct p.report_nm, p.database_nm, p.object_nm, p.schema_nm
from @potentials p
inner join sys.objects o on p.object_nm = o.name
inner join sys.schemas s on o.schema_id = s.schema_id and s.name = p.schema_nm
Conclusion
Using SSIS, basic SQL, and an open source component, it's possible to extract a reasonably accurate list of objects from reports. This allows us to perform tasks such as:
- Identify reports that may be impacted by table changes.
- Identify the most commonly used tables referenced in reports.
- Identify tables used in the past 30 days that have not been used previously.
While this approach has worked well in our environment, there are some limitations noted below.
Limitations
- Objects are only resolved in one database. For our environment, this is acceptable. Your environment may need enhancements to this approach for resolving objects across different servers and databases.
- Object identification is not perfect. SSRS queries can become quite complex with multiple CTEs, sub queries, applies, etc... Resolving aliases is beyond the capability of a simple regex statement.
- Columns are not resolved. Initial object identification is done using regular expressions and not a proper SQL parser.
- False positives are possible. Since objects are initial identified using regular expressions, it's entirely possible that an author could reference a table in a comment. That table would be picked up in this approach.
- The regular expression provided above has some flaws and will not identify all possible object name variations permitted by SQL Server.
Despite these limitations, this approach took less than a day to implement and added a lot of value in understanding objects referenced by reports. Future enhancements will improve the accuracy of pulling objects from SSRS report queries.
Technical Environment
Report Server and EDW Databases: SQL Server 2008 R2
Microsfot Tools: SQL, SSIS, SSRS
Opensource SSIS DFT Component: RegExtractor
Acknowledgements
RegExtractor - open source SSIS DFT Component created by Eric Just. http://regextractor.codeplex.com/