Family trees with ggraph

Networks can be useful when visualizing family trees. I explored the possibilities of doing this with the {ggraph} package for family information of Lemurs for #TidyTuesday. In this blogpost I describe step by step how to create the visuals.

Richard Vogg https://github.com/richardvogg
09-12-2021

Inspiration

I recently bought the beautiful book “Data Sketches” by Nadieh Bremer and Shirley Wu, and it has been a joy looking at their awesome projects. First of all, I really like the idea of having one project per month and having someone who pushes you and who “expects the output.” Maybe I would also need something similar for my blog. Currently, I am trying to participate in the weekly #TidyTuesday initiative, and in one of the past weeks, we were looking at lemur data. I planned to show the family tree for several lemur families as a network, inspired by what Nadieh Bremer did for the royal families. Take a moment to visit the stunning visuals she created.

Loading packages and data

For creating the family trees we will need dplyr for data manipulation, ggraph and igraph for the networks and graphlayouts for the manual positioning of the monkeys. The data can be loaded from the TidyTuesday repository as seen below.

library(dplyr)
library(ggraph)
library(igraph)
library(graphlayouts)

lemurs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv')

The data contains one row for each lemur. The taxon stands for the different lemur families, DMAD for example for the Aye-aye lemur family. If available, the data contains date of birth (dob), and the ID of mother and father.

taxon dlc_id sex dob dam_id dam_dob sire_id sire_dob lemur_name
DMAD 6771 M 2000-12-15 6674 1996-04-15 6201 1985-12-18 Ardrey-A
DMAD 6772 M 2001-01-07 6454 1988-11-30 6202 1986-12-19 Tolkein
DMAD 6786 F 2001-07-30 6452 1983-12-04 6201 1985-12-18 Lucrezia
DMAD 6787 F 2001-09-05 6674 1996-04-15 6201 1985-12-18 Salem
DMAD 6788 M 2001-10-25 6453 1985-10-11 6451 1981-10-08 Ozony Avelo
DMAD 6820 F 2003-09-26 6454 1988-11-30 6202 1986-12-19 Sabrina
DMAD 6821 F 2003-10-14 6453 1985-10-11 6451 1981-10-08 Medusa
DMAD 6842 F 2004-09-10 6721 1998-01-06 6202 1986-12-19 Medea
DMAD 6851 M 2005-02-22 6454 1988-11-30 6202 1986-12-19 Hitchcock

Build the network

First, we will build the edges. We want to have one connection from each father and each mother to their child.

edges <- tree %>%
  distinct(from = dam_id, to = dlc_id) %>%
  rbind(tree %>% distinct(from = sire_id, to = dlc_id))
from to
WILD 6454
6452 6480
6453 6514
6261 6515
6454 6561

Next, we create the vertices. We want to have some information stored in the vertices, namely the name of the lemur, the birthday, and the sex. Every vertex which was mentioned in the edges (i.e. every child, father and mother) has to be present in the vertices data frame. Therefore we will concatenate the rows of children, fathers and mothers. The columns have to have the same names for each group for the concatenation to work.

vertices <- tree %>%
  distinct(name = dlc_id, lemur_name, dob, sex) %>%
  rbind(tree %>% distinct(name = sire_id, lemur_name = NA, dob = sire_dob, sex = NA)) %>%
  rbind(tree %>% distinct(name = dam_id, lemur_name = NA, dob = dam_dob, sex = NA))

We have to remove duplicate names. And instead of birthday, we will just keep the birth year.

vertices <- vertices %>%
  group_by(name) %>%
  summarise(lemur_name = max(lemur_name, na.rm = TRUE),
            dob = max(dob, na.rm = TRUE),
            sex = max(sex, na.rm = TRUE)) %>%
  mutate(year = as.numeric(format(dob, '%Y'))) %>%
  select(-dob)
name lemur_name sex year
6201 Nosferatu M 1985
6202 Poe M 1986
6261 SAMANTHA F 1978
6262 ANNABEL LEE F 1988
6451 Mephistopheles M 1981

Now, we can create the graph. simplify removes loops and multiple edges, as.undirected removes the direction of the connections, which is important for the backbone network we will introduce later.

g <- graph_from_data_frame(edges, vertices = vertices) %>%
  simplify() %>%
  as.undirected()

Let’s take a look at the network. We color the nodes by sex and add the name of the individual close to each node. We use check_overlap = TRUE to remove labels if they overlap with others.

ggraph(g) +
  geom_edge_link0(edge_width = 0.1, alpha = 0.2)+
  geom_node_point(aes(col = vertices$sex))+
  geom_node_text(aes(label = vertices$lemur_name),
                 size = 5, check_overlap = TRUE, nudge_y = -0.1)

Now, we have a network where each child is connected to their parents. However, we are missing the temporal component. I first tried to put the year on one axis and a random value for each individual on the other axis, but it was a mess. This is when I found this blogpost by David Schoch. At the end of the post, David talks about backbone networks. The method described in this paper is used to disentangle networks with a lot of (weak) connections between all nodes. This is not our case, but it still came in handy to solve the problem I was trying to solve, namely to add a time component. Actually, David responded to the first version of this blog post and told me that there was even a better way for this usecase than using a backbone:

The layout_with_constrained_stress method from the {graphlayouts} package is giving us coordinates to plot our network manually. We will use the year of birth of each individual on the y-axis and get the corresponding x-axis value from the layout function.

bb <- layout_with_constrained_stress(g, coord = vertices$year, fixdim = "y")
  
ggraph(g, layout = "manual", x = bb[,1], y = bb[,2]) +
  geom_edge_link0(edge_width = 0.1, alpha = 0.2)+
  geom_node_point(aes(col = vertices$sex))+
  geom_node_text(aes(label = vertices$lemur_name),
                 size = 5, check_overlap = TRUE, nudge_y = -0.4)

In the end we can use themes and titles to make the plot prettier.

The next step is to put this procedure into a function to be able to repeat it easily for other families. If you want to see the whole code, take a look at my Github repository.

Closing comments