The {ggraph} packages allows to visualize networks and hierarchical data in beautiful ways. In this post I would like to show in which format your data has to be so that ggraph does what you want it to do.
I recently tried out {ggraph} by Thomas Lin Pedersen and think it is a great tool to add to one’s data visualization toolbox. This package allows to create networks and all kinds of cool plots with hierarchical data.
While I am quite familiar with ggplot (still have to google a lot, but I know what I have to do to get from data to a desired output), it took some time to understand the logic behind ggraph. The good news is: It is similar to ggplot, so the plot is created with a layer-like grammar which converts the raw data in one of these beautiful visualizations.
More information at the package’s website.
We will need the following packages.
The data for ggplot graphs is a dataframe or a tibble. For ggraph, we are working with networks and therefore need two components:
The edges define the connections between the nodes. And if we do not pass along any information with the nodes, it is enough to define a dataframe with edges.
Let’s take a look at a mini example:
edges <- data.frame(
from = c("father", "father", "father", "mother", "mother", "mother"),
to = c("me", "sister1", "sister2", "me", "sister1", "sister2")
)
We had to load the {igraph} package in the beginning as it contains the function which converts this to a graph.
g <- graph_from_data_frame(edges)
And this graph is used to visualize this small example:
ggraph(g) +
geom_edge_link() +
geom_node_text(aes(label = name))
This is a very small example. The next step would be to add information to the nodes. So far the nodes have been created from the edges by using the names appearing in the columns from
and to
(by the way: you can name them as you like and even add further columns - the first two columns will always indicate from which node to which node a line has to be drawn).
We can also do this manually:
vertices <- data.frame(name = c("mother", "father", "me", "sister1", "sister2"),
letters = c(7, 4, 7, 4, 7))
g <- graph_from_data_frame(edges, vertices = vertices)
ggplot2 users will be happy to hear that dealing with sizes, colors etc. is the exact same logic, you just have to add scale_edge_...
when you refer to edges.
ggraph(g) +
geom_edge_link() +
geom_node_text(aes(label = name, size = letters)) +
scale_size_continuous(range = c(2,4))
Enough with the basics, let’s look at real data.
The data stems from the Global Health Data Exchange website and you can customize the data download. It is really worth a visit, and contains country-level data around the Burden of Diseases, broken down by sex, age-group and year (1990 - 2019).
For this example I downloaded a subset containing the percentage of different death causes per country in 2019.
location | cause | val |
---|---|---|
Armenia | Encephalitis | 0.000533832 |
Greece | Neonatal disorders | 0.001085751 |
Chad | Non-Hodgkin lymphoma | 0.001367366 |
Honduras | Other transport injuries | 0.001786898 |
Indonesia | Maternal disorders | 0.003084283 |
South Sudan | Diphtheria | 0.000201279 |
Oman | Esophageal cancer | 0.002769811 |
Slovenia | Stroke | 0.098549024 |
Bolivia (Plurinational State of) | Bladder cancer | 0.002728930 |
Switzerland | Bacterial skin diseases | 0.001770625 |
The dataset contains 133 death causes and which percentage of total deaths they had in 2019 in each one of 213 countries.
First, we will try to make a treemap to show each country’s profile. For this, we will need some hierarchy. It took some manual work for me to get the hierarchical data from the website (which groups together certain death causes into higher level families).
The file will be on the second sheet of the excel file in this blogpost’s repository.
Cause | CauseL2 | CauseL3 |
---|---|---|
Diarrheal diseases | Enteric infections | Communicable, maternal, neonatal, and nutritional diseases |
Cysticercosis | Neglected tropical diseases and malaria | Communicable, maternal, neonatal, and nutritional diseases |
Falls | Unintentional injuries | Injuries |
Pneumoconiosis | Chronic respiratory diseases | Non-communicable diseases |
Adverse effects of medical treatment | Unintentional injuries | Injuries |
Self-harm | Self-harm and interpersonal violence | Injuries |
We will join the two datasets and filter out a country of interest.
country <- "Chile"
graph_data <- df %>%
filter(location == country) %>%
inner_join(causes, by = c("cause" = "Cause"))
In the introduction, we were dealing with networks, here we are dealing with hierarchical data, but the idea is the same: We will create edges between higher level and lower level features. In our case we have three levels and thus will create connections between Level 3 and Level 2 and then between Level 2 and Level 1.
Exactly as in our mini example, the edges data.frame will have two columns (from
and to
).
Similarly, we will do for the vertices. In theory, the vertices would just require the names of all causes from the three levels. We cannot have vertices with a value of 0 (unless we would remove them from the edges), so I am setting those to a very small value.
In this code I am adding a few extra columns which will help to create a better visual:
level
, so that not all the labels are displayed, but just the level 1 labels. This is stored in new_label
at the end of the code.mutate
.vertices <- graph_data %>%
select(name = cause, val = val, parent = CauseL2, parent2 = CauseL3) %>%
mutate(val = pmax(val, 0.000001), level = 1) %>%
rbind(graph_data %>%
distinct(name = CauseL2, parent = CauseL3, parent2 = NA) %>%
mutate(val = 0, level = 2)) %>%
rbind(graph_data %>%
distinct(name = CauseL3, parent = NA, parent2 = NA) %>%
mutate(val = 0, level = 3)) %>%
mutate(rank = rank(-val, ties.method = "first"),
new_label = ifelse(level==1 & rank <= 10, name, NA)) %>%
distinct(name, val, level, new_label, parent, parent2)
Let’s have a look at the data of the vertices:
name | val | parent | parent2 | level | new_label |
---|---|---|---|---|---|
Chronic kidney disease | 0.044547825 | Diabetes and kidney diseases | Non-communicable diseases | 1 | Chronic kidney disease |
Decubitus ulcer | 0.002074525 | Skin and subcutaneous diseases | Non-communicable diseases | 1 | NA |
Cardiovascular diseases | 0.000000000 | Non-communicable diseases | NA | 2 | NA |
Foreign body | 0.001681691 | Unintentional injuries | Injuries | 1 | NA |
Cardiomyopathy and myocarditis | 0.007762194 | Cardiovascular diseases | Non-communicable diseases | 1 | NA |
Falls | 0.013177099 | Unintentional injuries | Injuries | 1 | NA |
Good! We are ready to take a look at our graph. Some of the causes have very long names, so I use str_wrap
from {stringr} to cut them into several lines. You can also replace that part by new_label
and all label will appear as they are.
graph <- graph_from_data_frame(edges, vertices = vertices)
ggraph(graph, 'treemap', weight = val) +
geom_node_tile(aes(fill = parent2)) +
geom_node_text(aes(label = stringr::str_wrap(new_label,15), size = val)) +
guides(size = FALSE) +
labs(title = paste("Most frequent death causes in", country)) +
theme(legend.position = "bottom")
Let’s put all of the above in a function and call it get_country_profile
. Then we can easily create profiles for several countries and compare them. You can unhide the code if you want to see the final function.
get_country_profile <- function(country) {
graph_data <- df %>%
inner_join(causes, by = c("cause" = "Cause")) %>%
filter(location == country)
edges <- graph_data %>%
distinct(from = CauseL3, to = CauseL2) %>%
rbind(graph_data %>%
distinct(from = CauseL2,
to = cause))
vertices <- graph_data %>%
select(name = cause, val = val, parent = CauseL2, parent2 = CauseL3) %>%
mutate(val = pmax(val, 0.000001), level = 4) %>%
rbind(graph_data %>%
distinct(name = CauseL2, parent = CauseL3, parent2 = NA) %>%
mutate(val = 0, level = 3)) %>%
rbind(graph_data %>%
distinct(name = CauseL3, parent = country, parent2 = NA) %>%
mutate(val = 0, level = 2)) %>%
mutate(rank = rank(-val, ties.method = "first"),
new_label = ifelse(level==4 & rank <= 3, name, NA)) %>%
distinct(name, val, level, new_label, parent, parent2)
graph <- graph_from_data_frame(edges, vertices = vertices)
ggraph(graph, 'treemap', weight = val) +
geom_node_tile(aes(fill = parent2)) +
#geom_node_text(aes(label = stringr::str_wrap(new_label,15), size = val)) +
guides(size = FALSE) +
harrypotter::scale_fill_hp_d(option = "HarryPotter") +
labs(title = country)
}
p1 <- get_country_profile("Afghanistan")
p2 <- get_country_profile("Germany")
p3 <- get_country_profile("Chile")
p4 <- get_country_profile("Nigeria")
p5 <- get_country_profile("Japan")
p6 <- get_country_profile("Yemen")
p7 <- get_country_profile("New Zealand")
p8 <- get_country_profile("United States of America")
library(patchwork)
p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + plot_spacer() +
plot_layout(guides = "collect") &
theme(legend.title = element_blank(),
legend.position = "bottom")
If you feel that doing the manual step of creating the dataframes for edges and vertices is too much, you might be happy to hear that you can create great networks without doing that step manually.
For this we will additionally need the package {widyr}.
This package allows for pairwise comparisons between countries.
all_sim <- df %>%
pairwise_similarity(location, cause, val, upper = FALSE) %>%
filter(similarity > 0.95)
item1 | item2 | similarity |
---|---|---|
Kenya | Zimbabwe | 0.9597491 |
Czechia | High-middle SDI | 0.9630934 |
Iran (Islamic Republic of) | Iraq | 0.9664777 |
Tokelau | Saint Vincent and the Grenadines | 0.9611687 |
France | Belgium | 0.9760450 |
Botswana | Eswatini | 0.9914667 |
Switzerland | Iceland | 0.9816005 |
Vanuatu | Palestine | 0.9575626 |
Kazakhstan | Turkmenistan | 0.9623283 |
Kyrgyzstan | Libya | 0.9519206 |
net <- all_sim %>%
graph_from_data_frame()
net %>%
ggraph(layout="fr") +
geom_edge_link(aes(edge_alpha = similarity)) +
#geom_node_point() +
geom_node_text(aes(label=name), size = 2, col = "red",
check_overlap = TRUE)
This was just to show how quickly you can generate a plot using {widyr} and {ggraph}. This probably has too much information in it, but we can already see some interesting trends and connections between states which share different health issues.
I hope this post has sparked some curiosity in you to use the ggraph package. Although the data structure with edges and vertices is somewhat new, it is all about getting used to this format and soon you will create better and better visuals. And remember: You do not have to learn everything on the first day or with the first visual. Repeat and add small pieces of knowledge to your toolbox every time you come across interesting data.
Again, check out the website of the package for many more examples.