A continuation of the data dictionary discussion with information about how the documentation is done.
Today I’m going to continue discussing the data dictionary tables we’ve set up at the Boston Public Schools. The SQL for creating the tables and inserting the baseline example records is in my previous data dictionary post. The data inserted uses a few tables in the AdventureWorks database, and I’ll continue with this example.
At BPS, there were well over a hundred TableInfo records and several thousand DataInfo records that needed to be updated. We started with documenting the tables – here’s a SQL script for getting the necessary update statements:
-- Script to get the update statements for all tables: SELECT ' update TableInfo set tableGroup = '''', description = ''' + description + ''' where tablename = ''' + tablename + '''' FROM TableInfo -- Output from running the above script: UPDATE TableInfo SET tableGroup = 'Location', description = 'Address' WHERE tablename = 'Address' UPDATE TableInfo SET tableGroup = 'Person', description = 'Contact' WHERE tablename = 'Contact' UPDATE TableInfo SET tableGroup = 'Location', description = 'StateProvince' WHERE tablename = 'StateProvince'
We spent a few days focusing on building this out as any data analysis effort will begin at the table level before the field level. At Boston Public Schools, we divided tables into a number of different groups, including Registration, Assignment, Transportation, Enrollment, Attendance, Assessment, Administrative, and a handful of others. The groupings will help identify what tables to focus on when we’re working on a particular functional area for integration into the new Student Information System. For the description column, we generally provided a 1-2 sentence high-level description of what the table is.
After finalizing the table dictionary, we started focusing on the individual fields in these tables. Here’s how we got the update statements for DataInfo along with some example output:
-- SQL to get update statements: SELECT ' update DataInfo set [description] = ''' + columnname + ''', tableref = '''', lutyperef = '''', comments = '''' where tableid = ' + CAST(ti.tableid AS VARCHAR(5)) + ' and columnname = ''' + columnname + '''' + ' -- Group: ' + tableGroup + ', Table: ' + tablename FROM DataInfo di INNER JOIN TableInfo ti ON ti.tableid = di.tableid WHERE di.[description] = columnname AND ti.tablename NOT IN ( -- This check prevents getting update statements for tables that have already been updated in some way. SELECT tablename FROM DataInfo d INNER JOIN TableInfo t ON d.tableid = d.tableid WHERE [columnname] <> d.[description] GROUP BY tablename HAVING COUNT(*) > 0 -- This threshhold can be set higher if desired. ) ORDER BY tableGroup, ti.tablename, columnorder -- first 2 update statements for table Address: UPDATE DataInfo SET [description] = 'AddressID', tableref = '', lutyperef = '', comments = '' WHERE tableid = 1 AND columnname = 'AddressID' -- Group: Uncategorized, Table: Address UPDATE DataInfo SET [description] = 'AddressLine1', tableref = '', lutyperef = '', comments = '' WHERE tableid = 1 AND columnname = 'AddressLine1' -- Group: Uncategorized, Table: Address -- first 2 update statements for table Contact: UPDATE DataInfo SET [description] = 'ContactID', tableref = '', lutyperef = '', comments = '' WHERE tableid = 2 AND columnname = 'ContactID' -- Group: Uncategorized, Table: Contact UPDATE DataInfo SET [description] = 'NameStyle', tableref = '', lutyperef = '', comments = '' WHERE tableid = 2 AND columnname = 'NameStyle' -- Group: Uncategorized, Table: Contact
Notice that there are 4 values being updated: description, tableref, lutyperef, and comments. Here’s some information about each:
- 1) Description is a phrase or sentence describing the field. We left this field unchanged if the field name was self-evident for what the column held.
2) TableRef is used to specify if the field is referenced elsewhere [typically but not always linked by a foreign key constraint]. For DataInfo, the TableRef column would be populated for column tableid.
3) LuTypeRef is used to specify if the field is a value we have stored in our primary lookup table. We have hundreds of types, and many fields throughout the database store codes that map to a particular type in this lookup table. For example, in our student language information table, all first language codes map to the ‘Language’ type in the lookup table.
4) Comments are used if there is some complicated or subtle information about the field that isn’t self-evident from the field name or the brief description. For example, the sequence number field in our primary enrollment table has a 4-5 sentence comment for what the field is used for and how to pull the active enrollment record based on the sequence number.
Have we finished the documentation process at Boston Public Schools? No, it is still ongoing, although at this point only lower priority tables remain. So far it has been a very interesting exercise – I’ve learned a good deal about the data and table groupings that I didn’t know previously.