Whenever you have text data that was input manually by a human, there is a chance that it contains errors: Typos, abbreviations or different ways of writing can be challenges for your analysis. Fuzzy matching is a way to find inexact matches that mean the same thing like mcdonalds, McDonalds and McDonald’s Company.
The only packages you need are dplyr
and stringdist
.
This method requires as input two lists. To distinguish them, we will call the one that contains the handtyped input as the “dirty list”. The reference list will be called the “clean list”. In this blogpost I will create the dirty list by hand with a few made-up examples of alternative company names.
names <- c("Haliburton", "ExxonMobile","ABBOTT LABORATORIES","Marrriott","Self","Activision Blizzard",
"Quest dianotstics","Unemployed","other company","burger king",
"MARRIOT","wall mart", "Illumin", "3M","NORTHROP TRUMMON","MCCormicks","MARSH MCLEANNON",
"FLO SERVE", "Kansas City Southern Fed.","MCDONALD'S","F5 Networks",
"McDonalds","MacKindsey","Oracle","Self-employed","None","Retired",
"f5 networks","Harley Davidson","Harly Davidson","HARLEY DAVIDSEN","DRHorton","D.R. Horten",
"cincinati fin","cincinnatti financials","cincinnati financial","CINCINATTI FINANCE",
"Mohaws Industry","Mowahk Industries","Mohawk Ind")
set.seed(64)
dirty_list <- sample(names,50000,replace=T)
dirty_list |
---|
Haliburton |
ExxonMobile |
ABBOTT LABORATORIES |
Marrriott |
Self |
Activision Blizzard |
Quest dianotstics |
Unemployed |
other company |
burger king |
As a clean list we will use the list of S&P500 companies. This can be downloaded or scraped from the internet.
clean_list |
---|
3M Company |
Abbott Laboratories |
AbbVie Inc. |
ABIOMED Inc |
Accenture plc |
Activision Blizzard |
Adobe Systems Inc |
Advanced Micro Devices Inc |
Advance Auto Parts |
AES Corp |
Before we start, we will pre-process both lists, remove some common words and transform everything to lower case. If you prefer, you can also use the {stringr} package for this. One comment from my experience: Usually, the construction of the common words to remove is an iterative approach: You would check your final result and see which words are still causing problems. Then you add them to the cleaner function and run the process again until you are satisfied with the results.
cleaner <- function(vec) {
wordremove <- c(" and "," comp "," company","companies"," corp ","corporation"," inc ","[.]com")
output <- vec %>% tolower() %>%
{gsub(paste(wordremove,collapse='|'),"",.)} %>%
{gsub("[[:punct:]]","",.)} %>%
{gsub("[[:blank:]]","",.)}
return(output)
}
control <- data.frame(original=dirty_list)
clean_list_cl <- cleaner(clean_list)
dirty_list_cl <- cleaner(dirty_list)
We calculate a matrix of string distances. The {stringdist} package has a lot of different methods implemented which can be checked here. After comparing some of the methods I decided to go with the Jaro-Winkler distance as it yields higher similarity for words which start with the same letters.
stringdistmatrix(c("other","words","otherexample","exapmle"),
c("example","other example","word"),
method='jw',p=0.1,useNames="strings")
example other example word
other 1.00000000 0.12307692 0.5166667
words 1.00000000 0.48205128 0.0400000
otherexample 0.28174603 0.01538462 0.4444444
exapmle 0.03333333 0.55799756 1.0000000
Each row of the matrix of string distances is one string from the dirty list. We find the minimum in each row, which is equivalent to the best fit from the clean list.
distmatrix <- stringdist::stringdistmatrix(dirty_list_cl,clean_list_cl,method='jw',p=0.1)
best_fit <- apply(distmatrix,1,which.min) %>% as.integer()
similarity <- apply(distmatrix,1,min)
control$best_fit <- clean_list[best_fit]
control$distance <- round(similarity,3)
original | best_fit | distance |
---|---|---|
Marrriott | Marriott Int’l. | 0.089 |
Retired | ResMed | 0.203 |
Self | Sealed Air | 0.244 |
Haliburton | Halliburton Co. | 0.054 |
F5 Networks | F5 Networks | 0.000 |
Marrriott | Marriott Int’l. | 0.089 |
Illumin | Illumina Inc | 0.073 |
Mohaws Industry | Mohawk Industries | 0.113 |
MARSH MCLEANNON | Marsh & McLennan | 0.030 |
Self-employed | Teleflex | 0.306 |
When we order the control dataframe by similarity we can find a suitable cutoff value (in this example 0.12) to separate real matches from false positives. This cutoff value depends on the application.
control$result <- ifelse(control$distance<=0.12,control$best_fit,NA)
original | best_fit | distance | result |
---|---|---|---|
Marrriott | Marriott Int’l. | 0.089 | Marriott Int’l. |
Retired | ResMed | 0.203 | NA |
Self | Sealed Air | 0.244 | NA |
Haliburton | Halliburton Co. | 0.054 | Halliburton Co. |
F5 Networks | F5 Networks | 0.000 | F5 Networks |
Illumin | Illumina Inc | 0.073 | Illumina Inc |
Mohaws Industry | Mohawk Industries | 0.113 | Mohawk Industries |
MARSH MCLEANNON | Marsh & McLennan | 0.030 | Marsh & McLennan |
Self-employed | Teleflex | 0.306 | NA |
Mohawk Ind | Mohawk Industries | 0.088 | Mohawk Industries |
Mowahk Industries | Mohawk Industries | 0.017 | Mohawk Industries |
Harly Davidson | Harley-Davidson | 0.014 | Harley-Davidson |
f5 networks | F5 Networks | 0.000 | F5 Networks |
3M | 3M Company | 0.000 | 3M Company |
Oracle | Oracle Corp. | 0.080 | Oracle Corp. |