Blog Post

Geneology with SQL Graph – I

,

I have been working a lot of SQL Graph related queries and applications of the graph data concept to the extent possible within SQL Server’s graph capabilities. Genealogy, or querying family trees is an important graph data application. A lot of us may not have work related applications that are genealogy related, necessarily. But conceptually, this can apply to many similar tree/hierarchy type structures. I was looking into some data to play with in this regard. Sometime ago – we were discussing novels by famed novelist James Michener. My friend Buck Woody made a tweet-remark that it would need a graph database to keep track of the characters and relationships in some of Michener’s novels. I am a big fan of Michener’s novels, and the most recent one I have read is ‘Hawaii’. It is based on history and evolution of the Hawaiian islands, and has a rather complex network of characters, with many ethnicities and several interwoven relationships. I decided to use the characters in Hawaii as my test data to understand how to query geneological data, stored in graph database format.

The novel spans a huge historic period, from 724 AD into around 1937. It has primarily people from 4 ethnic groups: Caucasian North Americans, Japanese, Chinese and Native Hawaiian. Out of these, the Japanese remain their own group, largely. The other ethnic groups intermarry, a lot. There is polygamy, polyandry and all kinds of weird relationships from that time. To keep things simpler, I came up with what I would like to query on from such a database , if it existed:

1 What is the family tree starting from Person A?

2 What is the mix of ethnicity this person has?

3 How are person A and person B connected, or if they are connected?

4 Who is the oldest traceable ancestor of person A ?

5 How many first/second/third cousins does this person have?

Graph data is designed strongly based on what kind of querying you want to do. If we look at these questions, my queries are based on people, and their relationships. So I need a people table, which is my ‘node’ table, and I need a relationship table, which is my ‘edge’ table.

If you were to design it in the relational world , you would need some kind of hierarchy to indicate the level the person is. You would need multiple tables for relationships such as children, partners, parents and so on. And then you’d have to write a bunch of join based queries and use recursive CTEs to get the results we want.

Now, SQL server not being a full fledged graph implementation – would still mandate the use of recursion to get our results. But, the way we store data and our queries can be a lot simpler in graph model.

I just have one table, which I call a PersonNode. In this I store all details regarding the individual – name, sex, ethnicity(blended if person is product of an interracial partnership). Next, I have a second table, which I call PersonRelatedTo. It has a from_id and a to_id from the PersonNode table. And it has a remark that tells us how these two people are related.

CREATE TABLE [dbo].[PersonNode](
    [Name] varchar NULL,
    [Sex] char NULL,
    [Born] [smallint] NULL,
    [Died] [smallint] NULL,
    [Origin] varchar NULL,
    [Remarks] varchar NULL,
    [PersonId] [int] IDENTITY(1,1) NOT NULL
)
AS NODE ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
CREATE TABLE [dbo].[PersonRelatedTo](
     [remarks] varchar NULL
 )
 AS EDGE ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
 GO

I insert into these tables the characters from Hawaii and how they are related. I kept the relationships down to ‘father of’, ‘mother of’,’husband of’,’wife of’ to begin with. Other relationships can be derived from here. We can choose to store them, to make our querying easier, or not. The novel has many partnerships that are not technically husband/wife, but I kept the dataset simpler and made it this way.

While querying this structure for constructing a family tree, I ran into following limitations:

1 A lot of characters have just one parent defined. The graph match query mandates an equi join on either side so pulling both parents in a single query when one may not exist is not possible with this clause.

2 The mother’s side and father’s side each can have its own branch since many characters partnered many times with multiple people. This meant querying each side separately and joining the results.

3 Some people are just listed as ancestors, not parents. There are links lost in the chain – these people have to be added in as a parent depending on gender.

4 No left joins are allowed in CTEs either.

My query ended up rather clunky/not-very-graceful because of these limitations, but it works. I am still working on making it more elegant. With the help of below query, I was able to pull 4 main trees that go into core family structure of the novel. In the next post we can look at how to query the tree further – to find cousins, ethnic mix and various. My results with this query, on the chinese, american and native families that form core of Hawaii novel, are below.

CREATE   PROCEDURE [dbo].[usp_getfamilytree]
  @personname varchar(200)
 AS
 DECLARE @FamilyMother TABLE(ID int, MotherID int,MotherName varchar(200),origin varchar(100))
 DECLARE @FamilyFather TABLE(ID int, FatherID int,FatherName varchar(200),origin varchar(100))
 DECLARE @personid INT
 SELECT @personid = personid FROM dbo.personnode WHERE name = @personname
 INSERT INTO @FamilyMother
 (id,motherid,mothername,origin)
 select p1.personid,p.personid,p.name,p.origin
 FROM dbo.personnode p, dbo.personrelatedto r, dbo.personnode p1
 where match(p-(r)->p1) and (r.remarks = 'mother of' OR (r.remarks = 'ancestor of' and p.sex = 'F')) 
 INSERT INTO @FamilyFather
 (id,fatherid,fathername,origin)
 select p1.personid,p.personid,p.name,p.origin
 FROM dbo.personnode p, dbo.personrelatedto r, dbo.personnode p1
 where match(p-(r)->p1) and (r.remarks = 'father of'  OR (r.remarks = 'ancestor of' and p.sex = 'M'))
 ;WITH FamilyCTE
 AS
 (
     SELECT 
       @personid as ID,
       1 AS Level,
       null as fatherid,
       null as motherid
       UNION ALL
     SELECT 
       m.ID, 
       Level+1,
       null,
       m.Motherid
       FROM @FamilyMother AS M
   INNER JOIN FamilyCTE c ON m.motherID = c.ID
     UNION ALL
     SELECT 
       f.ID, 
       Level+1,
       f.fatherid,
       null
       FROM @Familyfather AS f
   INNER JOIN FamilyCTE c ON f.fatherID = c.ID  
 )
 --select * from familycte
 SELECT distinct p.name,p.born,p.died, q.level, p.sex,p.origin,
 m.mothername as Mother, m.origin as MotherOrigin, 
 f.fathername as Father,f.origin as fatherorigin
 FROM familyCTE Q
 INNER JOIN dbo.personnode p ON q.id = p.personid
 LEFT JOIN @familyfather f ON p.personid = f.id
 LEFT JOIN @familymother m ON p.personid = m.id
 order by q.level,f.fathername
 GO;WITH FamilyCTE
AS
(
    SELECT 
      @personid as ID,
      1 AS Level,
      null as fatherid,
      null as motherid
      UNION ALL
    SELECT 
      m.ID, 
      Level+1,
      null,
      m.Motherid
      FROM @FamilyMother AS M
  INNER JOIN FamilyCTE c ON m.motherID = c.ID
    UNION ALL
    SELECT 
      f.ID, 
      Level+1,
      f.fatherid,
      null
      FROM @Familyfather AS f
  INNER JOIN FamilyCTE c ON f.fatherID = c.ID  
)
--select * from familycte
SELECT distinct p.name,p.born,p.died, q.level, p.sex,p.origin,
m.mothername as Mother, m.origin as MotherOrigin, 
f.fathername as Father,f.origin as fatherorigin
FROM familyCTE Q
INNER JOIN dbo.personnode p ON q.id = p.personid
LEFT JOIN @familyfather f ON p.personid = f.id
LEFT JOIN @familymother m ON p.personid = m.id
order by q.level,f.fathername
GO

To get the chinese branch – I go to the oldest chinese ancestor described, who has successors.

usp_getfamilytree ‘Char Ti Chong’

To get the american family (there are two but lets go with one).

usp_getfamilytree ‘Gideon Hale’

To get the native family,

usp_getfamilytree ‘Noelani Kanakoa’

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating