A while ago I attended a course in identifying trees and bushes growing in Western Europe by making use of the summer characteristics. The students were handed out a guide containing an identification key, which leads them through a number of identification steps to the name of the plant at hand. Every new identification should start at the first step, and in many cases plants were identified at the species level (e.g. Ginkgo biloba). In other cases identification went only as far as the genus level (e.g. Salix). For the latter group supplementary identification keys existed. In this article however, we focus at the main key only.
The identification steps of the main key all treat a certain characteristic of the plant, each of them offering two alternatives, which is called a dichotomous key. Each alternative points at a next identification step or gives the name of a plant. The first few identification steps looked as follows.
1. Grass-like leaves, nerves parallel -> 2.
Leaves otherwise -> 3.
2. Leaves sparse at hollow main stem: Bamboo
Leaves very close together at massive main stem: Yucca
3. Leaves absent or reduced to thorns -> 4.
Plant with leaves, needles or scales -> 7.
Some plant names are found after a few steps, but others are reached only after a long search of 30 steps or more. I noticed that for many of the names there existed more than one path through the decision tree. This was especially the case where identification went only to the genus level, as can be expected. Most genusses contain multiple species with many characters in common, but a few may differ. However, also for a number of species different paths existed. An example is Ginkgo biloba, wich can be found through two different paths.
Path 1: 1, 3, 7, 8, 9, 12, 13, 500, 518, 529, 532, 533
Path 2: 1, 3, 7, 8, 9, 12, 13, 700, 701, 708
In this example you can see that at identification step number 13, both alternatives (pointing at 500 and 700) can lead to the same name. Obviously the character treated in article 13 is not useful for the special case of Ginkgo biloba. Botanicaly it would be interesting to know the content of this step, but here we will focus at a technical question, namely: How can SQL Server help creating a list of all possible paths along with the plant names, so that the steps where different routes branch off can be quickly identified. The desired output is shown in the Excel sheet 'Identification Paths.xlsx'.
The sequence of steps to traverse the identification key and the variable length of the paths do expect that there should be a way to create the desired list of paths using a recursive CTE. However, I couldn’t find a useful example on the web and decided to try it out myself.
To be able to query anything, we need a table with all the nessesary data. This table should contain only the structure of sequencing and the plant names found at the end of each path. The content of the statements contained in the identification steps is irrelevant, so this can be left out. The new table consists of the following columns:
- Step: the number of the identification step
- Next: the step number the alternative is pointing at
- Genus: genus name of the plant found
- Species: species name of the plant found
- EnglishName: (for future use)
To create the table use the following code:
USE TestDB; --or your favorite database GO CREATE TABLE dbo.Steps( Step INT NOT NULL, [Next] INT NULL, Genus NVARCHAR(50) NULL, Species NVARCHAR(50) NULL, EnglishName NVARCHAR(50) NULL);
For each article there should be two rows in the table, because our identification key exclusively contains steps with two alternatives. If Next contains a value, Genus must be empty and vice versa. Species can only contain a value if Genus is not empty. The appended text file, Steps.txt, contains the content for our new table. Examining the content of Steps.txt carefully reveals that there are gaps in the step numbers. This reflects the fact that the original identification key at some places jumps to a higher rounded number where a group of plants with an important shared character starts. For our query this should not be a problem.
To get the content of Steps.txt into the newly created table, place the text file on some suitable location, e.g. on D:, and fill the table with bulk insert.
BULK INSERT dbo.Steps FROM 'D:\Steps.txt' WITH ( FIELDTERMINATOR =';', FIRSTROW = 1 );
Now comes the main part: creating a CTE that strings together the steps to form all the different paths leading to the plant names. Because we always start at the first step, that one should be the anchor part of the CTE. In the recursive part we connect all subsequent article numbers into strings, forming the paths. During each new cycle the CTE goes through, a new set is built upon the previous set, as far as next steps are needed. Since we are only interested in completed paths, in the main query we filter on Next IS NULL. For presentation purpose we add an ORDER BY Name, Path.
Here is the CTE and query:
;WITH C AS ( SELECT CAST([Step] AS VARCHAR(200)) AS Path, [Next], [Genus], [Species] FROM [dbo].[Steps] WHERE Step = 1 UNION ALL SELECT CAST(C.Path + ', ' + CAST(B.[Step] AS VARCHAR(200)) AS VARCHAR(200)) AS Path, B.[Next], B.[Genus], B.[Species] FROM [dbo].[Steps] AS B INNER JOIN C ON B.[Step] = C.[Next] ) SELECT CONCAT([Genus], ' ', [Species]) AS Name, Path FROM C WHERE [Next] IS NULL ORDER BY Name, Path OPTION (MAXRECURSION 100); –-prevent an endless loop
Try it out and enjoy, just like my teachers of the identification course did! Without this solution, identifying the crucial step for diverging paths was like finding a needle in a haystack. For finding the step at wich two given plants diverge, just add the genus and species names to the WHERE clause of the main query, but make sure not to refer to the Name alias of the SELECT clause.
I didn’t use the EnglishName column. I added it for the case this solution was going to be used by people not familiar with the scientific names of the plants. In that case the query has to be adapted, but care has to be taken that English names and scientific names may not always have a 1 : 1 relation.
This query can easily be adjusted for use in other kinds of trees. Think for example at family trees. There is no need for the nodes having all just two alternatives.
The essential part I needed in my query is reusing the previous value of the recursive CTE to construct the new value. In the above case this was achieved by concattenating strings, but other ways of combining data should also be possible. That leads to thinking about calculations with mathematical series, like Eulers formula for the approximation of π: π2/6 = 1 + 1/22 + 1/32 + 1/42 + …
Let’s give it a try:
;WITH A AS( SELECT CAST(1 AS REAL) AS [PI], 2 AS Num UNION ALL SELECT [PI] + 1./(Num*Num) AS [PI], Num + 1 AS Num FROM A WHERE Num < 4000 ) SELECT SQRT([PI]*6) AS [PI] FROM A OPTION (MAXRECURSION 5000);
Well, it works, but on my computer I see irregularities (repeating values) when Num values get higher. For the last 10 Num values (between 3990 and 3999) I get
3.14138131974922
3.14138131974922
3.14138147154141
3.14138162333359
3.14138162333359
3.14138177512576
3.14138192691793
3.14138207871009
3.14138223050224
3.14138223050224
Maybe it works better on a computer with an other mathematical processor, or maybe it has to do with the limitations of the REAL data type. However, perhaps not a serious option for mathematical series calculation, but in many other cases recursive CTE calculations can be a handy way for sure.