Animations can help to show events over time. I found data from the RKI about daily COVID cases in Germany and want to describe the process of creating the animation. It involves fuzzy matching, as the names of the counties (Landkreise) are not identical in the RKI data and the shapefile I used.
I found a dataset about COVID cases in the different communes (LK) in Germany in an Excel file online (Link). It shows cases (Fälle) and the 7-Day-Incidence value (per 100,000 inhabitants).
(Note: unfortunately, the RKI changed the format of the data shortly after I downloaded the file, now the same data is only available on a broader Bundesland level. You might want to try this Link).
LK | Fälle | Inzidenz |
---|---|---|
SK Augsburg | 999 | 336.8377 |
SK Herne | 464 | 296.5823 |
SK Berlin Neukölln | 917 | 285.5604 |
SK Duisburg | 1401 | 280.9383 |
LK Rottal-Inn | 335 | 275.7156 |
LK Bautzen | 780 | 260.2099 |
The SK (Stadt) and LK (Landkreis) are important, as they indicate whether it is about the city or the county around the city. In some cases you have both.
LK | Fälle | Inzidenz |
---|---|---|
LK Würzburg | 237 | 146.02408 |
SK Würzburg | 163 | 127.40945 |
SK Aschaffenburg | 59 | 83.09625 |
LK Aschaffenburg | 130 | 74.62687 |
Now, if I want to show this information on a map, I need a shapefile with the communes. The good news is, there is one (Link). The bad news is, the names are in a different format than in the Excel file with the COVID cases.
GEN | BEZ |
---|---|
Flensburg | Kreisfreie Stadt |
Kiel | Kreisfreie Stadt |
Lübeck | Kreisfreie Stadt |
Neumünster | Kreisfreie Stadt |
Dithmarschen | Kreis |
Herzogtum Lauenburg | Kreis |
A first fix to bring the two formats closer together would be to convert BEZ into LK and SK and paste it together with GEN.
name |
---|
SK Flensburg |
SK Kiel |
SK Lübeck |
SK Neumünster |
LK Dithmarschen |
LK Herzogtum Lauenburg |
This already helps a lot. Let’s see if we can join the COVID indicence values to the shapefile with the communes.
This already looks quite good, but there are a lot of missing communes.
Let’s look into the Excel file with COVID cases which ones did not have a match.
LK | Fälle | Inzidenz |
---|---|---|
SK Berlin Neukölln | 917 | 285.5604 |
SK Offenbach | 330 | 253.3006 |
SK Mülheim a.d.Ruhr | 378 | 221.5294 |
SK Berlin Friedrichshain-Kreuzberg | 617 | 218.7486 |
SK Berlin Reinickendorf | 554 | 213.3321 |
LK Lindau | 162 | 197.6068 |
Can we find equivalent names in the shapefile? We search for Berlin, Offenbach, Mülheim and Lindau.
name |
---|
SK Mülheim an der Ruhr |
SK Offenbach am Main |
LK Offenbach |
LK Lindau (Bodensee) |
SK Berlin |
We notice a few things: Berlin has only one entry in the shapefile and a more detailed breakdown in the COVID cases document. SK Offenbach is called Offenbach am Main, Lindau is called Lindau (Bodensee). And Mülheim a.d.Ruhr is written Mülheim an der Ruhr.
One option would be to find all the matches manually and replace them. There are around 40, so this would be feasible but a little annoying. Maybe we can save this time of comparing each element and let the computer find the best match for us.
Fuzzy matching is doing exactly this: Based on string distances, it finds the closest match in the other source.
I wrote a function to do the calculation of the string distances and find the best fit. Additionally, we have the option to clean the input before, i.e. transform all letters to lowercase, remove or replace certain words which disturb the process. Usually such words would become clear after running the matching process once and noticing some undesired results.
original | best_fit | similarity |
---|---|---|
StadtRegion Aachen | LK Städteregion Aachen | 0.2315310 |
SK Berlin Friedrichshain-Kreuzberg | SK Berlin | 0.1470588 |
SK Berlin Steglitz-Zehlendorf | SK Berlin | 0.1379310 |
Region Hannover | LK Region Hannover | 0.1333333 |
SK Berlin Reinickendorf | SK Berlin | 0.1217391 |
LK Lindau | LK Lindau (Bodensee) | 0.1100000 |
LK Sankt Wendel | LK St. Wendel | 0.0970513 |
SK Berlin Spandau | SK Berlin | 0.0941176 |
SK Neustadt a.d.Weinstraße | SK Neustadt an der Weinstraße | 0.0915340 |
SK Mülheim a.d.Ruhr | SK Mülheim an der Ruhr | 0.0841542 |
SK Ludwigshafen | SK Ludwigshafen am Rhein | 0.0750000 |
SK Frankenthal | SK Frankenthal (Pfalz) | 0.0727273 |
SK Landau i.d.Pfalz | SK Landau in der Pfalz | 0.0665072 |
SK Halle | SK Halle (Saale) | 0.0545455 |
LK Neustadt a.d.Aisch-Bad Windsheim | LK Neustadt a.d. Aisch-Bad Windsheim | 0.0341270 |
SK Freiburg i.Breisgau | SK Freiburg im Breisgau | 0.0264822 |
LK Mühldorf a.Inn | LK Mühldorf a. Inn | 0.0111111 |
LK Neumarkt i.d.OPf. | LK Neumarkt i.d. OPf. | 0.0095238 |
LK Pfaffenhofen a.d.Ilm | LK Pfaffenhofen a.d. Ilm | 0.0083333 |
LK Wunsiedel i.Fichtelgebirge | LK Wunsiedel i. Fichtelgebirge | 0.0066667 |
This table serves as a control point. We can check that all the items were matched correctly. It also serves as a dictionary which we will use to replace the original names before joining.
To create the gif, we select dates of interest (in our case day by day from Feb 15 to Nov 25).
We loop through the dates and create one image for each day. Note how easy it is to create the Germany map with ggplot(aes(fill=cases7_per_bin))+geom_sf()
from the {sf} package. The rest is just changing the colors.
for (i in seq_along(sel_dates)) {
message(paste(" - image", i, "of", length(sel_dates)))
map <- final %>%
filter(date %in% sel_dates[i]) %>%
ggplot(aes(fill=cases7_per_bin))+geom_sf()+
labs(title = 'COVID in Germany',
subtitle = format(sel_dates[i],"%B"),
fill = "7 day incidence values",
caption = "Data from: NPGEO Corona, Hub RKI") +
scale_fill_manual(values=c("grey80","grey60",
"grey40","gold2","orangered1","red3","red4"),drop=FALSE)
png(filename=paste0("covid",i,".png"),width = 800,height = 800)
print(map)
dev.off()
}
After saving all the images, we can use {magick}’s image_write_gif
function to put them together and create our gif.
magick::image_write_gif(magick::image_read(img_frames),
path = "covid.gif",
delay = 1/10)
fuzzy_matches
performs the task of matching names from one datasource to another.