This editorial was originally published on Aug 26, 2015. It is being re-run as Steve is out of town.
The EAV (Entity-Attribute-Value) data model gets bad press in the world of the relational database, and with some justification. I'd read a lot of the evidence against it and even published on Simple-talk a tale of EAV gone wrong that is not for the faint-hearted. I regarded EAV designs largely as an "anti-pattern" arising from misguided attempts to transpose the world of object-orientation and loosely-typed languages directly into the relational model.
If EAV design has a "avoid at all costs" reputation among many, the fact remains that sometimes you just can't know in advance all the required attributes, and in such cases there are few alternatives. Peter Larsson, in his EAV session at the recent SQL Rally event in Amsterdam, described one such case, a database application to return medical insurance documentation, where there was simply no way to predict, over time, exactly what sort of data may need to be stored.
When Peter arrived on the project, the database contained 20 million rows, and one of the most important document search algorithms took 1 minute to return its data. When the table reached 1 billion rows, which would happen quite quickly at current growth rates, they estimated, with a high degree of confidence, that the same document search would take 134 days.
Peter then explained – and proved – that after fixing some flaws in their database design, and in their inefficient search algorithm, the same document search query, on a billion rows, returned its data in milliseconds. I nearly fell off my chair. The remainder of the session was going to explain how he did it, and it's safe to say he had his audience's attention.
I wondered briefly if by "tweaking the database design" he really meant "replace it with a proper relational model", but no. He had applied some sensible normalization but the model was a hybrid, with an EAV table containing unique attribute-value pairs alongside the normalized tables. In the EAV table, each row comprises three columns that describe an entity, an attribute, or characteristic, of that entity, and a value for that attribute.
He explained that the key to efficient querying of such a model was a technique called relational division. Let's say you live in a dorm with a number of other students who all own random number of socks of various colors. Your socks are red, green and blue and you want to know which other students have a matching set of socks. In relational division, you query the table (the dorm) for each entity (student) and their attributes (socks) that match your values (colors). The dorm is the dividend and your socks are the divisor. If student A owns yellow, black and green socks, the quotient is zero (not fulfilled). However, if Student B has yellow, red, black, blue, purple and green socks then the quotient is 1, since you and student B own the same subset of red, green and blue socks.
With the right algorithm, Peter proved that relational division can be highly efficient. Coupled with a clever approach to indexing, to facilitate "ordered index scans", and sensible statistics management, he was able to achieve formidable results – at least, I think we can call 134 days to a few milliseconds, for a billion rows, formidable.
Clearly EAV models are "difficult". They can and frequently do cause bad performance and maintenance problems, as the database grows. However, I also appreciated the lesson in why you should rarely form closed opinions on 'good' or 'bad' practices in database and query design. Sometimes there is no alternative to an EAV design, and the techniques do exist to make them work in a relational world.
Cheers,
Tony.
Further reading: Check out Peter's The E, the A and the V PDF