The strength and holy grail of relational databases lies in the very name – relations. That’s what they traditionally excel in. To find a related record through several tables using proper indexes is not a time-consuming problem even if a database is really big, containing millions of records. Microsoft has put a good deal of intelligence and cunning into query optimizations, caching, indexing and execution plans to make the process of finding related records even smoother and faster.
This small article, however, will try to shatter the very sacred notion of relational databases, i.e. of linking the tables and also will attempt to show a slightly different mechanism of relating data.
The idea first came when I tried to write a simple interface that would via ADO imitate an object database behavior with an underlying SQL server database as data source. The task is by far not new, nor is there any shortage of programmatical solutions. Still, most (if not all) of those I came across required a very rigid database structure that had to be prepared beforehand, including lots of service tables, triggers, keys, etc.
But what happens if you just create objects (like Person, Address, Product, etc.) but do not know how they will be related? Or, if they already are, there is a new relation coming in? Or, in a case when you have one-to-many, the project manager comes and confesses it really should have been many-to-many. So, what I mostly tended to achieve was complete freedom from any constraints, indexes and foreign keys that usually result from building relations – the goal was: build singular atomic objects and devise some universal way of relating any object to any other object.
The solution is quite trivial and must have already occurred to you – a single intermediate table, like the ones used in many-to-many relationships. This table can serve all other tables in all possible kinds of relationships, including one-table hierarchies.
Another vital issue is the use of GUIDs (Globally Unique IDentifiers) that are generated by SQL server (or returned by a function NewID()) and are guaranteed to be unique throughout the database (actually, throughout the world). Every table has a clustered primary key on a GUID column - in this case every record is uniquely referenced by its GUID.
Here is the intermediate table creation script:
CREATE TABLE [dbo].[Relations] ( [GUIDA] [uniqueidentifier] NOT NULL , [GUIDB] [uniqueidentifier] NOT NULL ) ON [PRIMARY]
Of course, one would need indexes to speed up searches. Here they are:
ALTER TABLE [dbo].[Relations] ADD CONSTRAINT [PK_Relations] PRIMARY KEY CLUSTERED ([GUIDA]) ON [PRIMARY] CREATE INDEX [IX_Relations] ON [dbo].[Relations]([GUIDB]) ON [PRIMARY]
Making both fields the clustered primary key may be tempting but yields much worse results with heavy scans happening all the time. But here comes another problem. What if we need to query two tables, but do not know in what field A or B the first GUID is used? The trigger will take care of that:
CREATE TRIGGER Relations_Insert ON [dbo].[Relations] FOR INSERT AS SET NOCOUNT ON INSERT INTO Relations (GUIDA, GUIDB) SELECT GUIDB, GUIDA FROM inserted
This trigger will effectively flip any new record and insert the result as a new record – making the order of fields in relation absolutely irrelevant, but, obviously, duplicating the overall number of records. (Speaking of which, you may want to add ON DELETE trigger to clean the duplicate when a record is deleted)
So, what happens now? Suppose we have a Person object (which is really a table, but treated as an object to imitate object database) and an Address object. In a conventional database with Address having a foreign key to Person, the query will look like the following:
SELECT Person.Last_name, Address.Country FROM Person INNER JOIN Address ON Person.Person_GUID = Address.Person_GUID WHERE Address.Country = 'Belgium'
In the case with the universal intermediate table, however, it will be something like:
SELECT Person.First_name, Address.Country FROM Person INNER JOIN Relations ON Person.Person_GUID = Relations.GUIDA INNER JOIN Address ON Relations.GUIDB = Address.Address_GUID WHERE Address.Country = 'Belgium'
The execution plans look surprisingly the same – actually there had been almost no drop in performance when both queries were tested on a database with half a million records in each Person and Address tables and, consequently a million records in Relations table. Which arouses the natural question of performance.
This technique obviously is not aimed at increased performance nor can it ever be made as performant as conventional foreign-key queries (although with proper indexing and occasional INNER LOOP JOIN query hint the results surprisingly may be quite acceptable). Indeed, a query including five tables will, in this case, include nine of them, naturally adding an overhead of index seeks/scans. The matter is, this technique’s objective is not performance, but complete freedom of linking objects and creating hierarchies with no additional structure changes. In the case of programmatical/data objects this means that any object instance can be logically linked to any other object instance (or even to itself), at any given moment, in any hierarchical structure providing flexible data source for objects manipulation.