Fuzzy matching example with company names

Whenever you have text data that was input manually by a human, there is a chance that it contains errors: Typos, abbreviations or different ways of writing can be challenges for your analysis. Fuzzy matching is a way to find inexact matches that mean the same thing like mcdonalds, McDonalds and McDonald’s Company.

Richard Vogg https://github.com/richardvogg
10-09-2020

Packages

The only packages you need are dplyr and stringdist.

The data

This method requires as input two lists. To distinguish them, we will call the one that contains the handtyped input as the “dirty list”. The reference list will be called the “clean list”. In this blogpost I will create the dirty list by hand with a few made-up examples of alternative company names.

names <- c("Haliburton", "ExxonMobile","ABBOTT LABORATORIES","Marrriott","Self","Activision Blizzard",
           "Quest dianotstics","Unemployed","other company","burger king",
           "MARRIOT","wall mart", "Illumin", "3M","NORTHROP TRUMMON","MCCormicks","MARSH MCLEANNON",
           "FLO SERVE", "Kansas City Southern Fed.","MCDONALD'S","F5 Networks",
           "McDonalds","MacKindsey","Oracle","Self-employed","None","Retired",
           "f5 networks","Harley Davidson","Harly Davidson","HARLEY DAVIDSEN","DRHorton","D.R. Horten",
           "cincinati fin","cincinnatti financials","cincinnati financial","CINCINATTI FINANCE",
           "Mohaws Industry","Mowahk Industries","Mohawk Ind")

set.seed(64)
dirty_list <- sample(names,50000,replace=T)
dirty_list
Haliburton
ExxonMobile
ABBOTT LABORATORIES
Marrriott
Self
Activision Blizzard
Quest dianotstics
Unemployed
other company
burger king

As a clean list we will use the list of S&P500 companies. This can be downloaded or scraped from the internet.

clean_list
3M Company
Abbott Laboratories
AbbVie Inc.
ABIOMED Inc
Accenture plc
Activision Blizzard
Adobe Systems Inc
Advanced Micro Devices Inc
Advance Auto Parts
AES Corp

Before we start, we will pre-process both lists, remove some common words and transform everything to lower case. If you prefer, you can also use the {stringr} package for this. One comment from my experience: Usually, the construction of the common words to remove is an iterative approach: You would check your final result and see which words are still causing problems. Then you add them to the cleaner function and run the process again until you are satisfied with the results.

cleaner <- function(vec) {
  wordremove <- c(" and "," comp "," company","companies"," corp ","corporation"," inc ","[.]com")
  output <- vec %>% tolower() %>% 
    {gsub(paste(wordremove,collapse='|'),"",.)} %>%
    {gsub("[[:punct:]]","",.)} %>%
    {gsub("[[:blank:]]","",.)}
  return(output)
}

control <- data.frame(original=dirty_list)

clean_list_cl <- cleaner(clean_list)
dirty_list_cl <- cleaner(dirty_list)

Main process

We calculate a matrix of string distances. The {stringdist} package has a lot of different methods implemented which can be checked here. After comparing some of the methods I decided to go with the Jaro-Winkler distance as it yields higher similarity for words which start with the same letters.

Example

stringdistmatrix(c("other","words","otherexample","exapmle"),
                 c("example","other example","word"),
                 method='jw',p=0.1,useNames="strings")
                example other example      word
other        1.00000000    0.12307692 0.5166667
words        1.00000000    0.48205128 0.0400000
otherexample 0.28174603    0.01538462 0.4444444
exapmle      0.03333333    0.55799756 1.0000000

Each row of the matrix of string distances is one string from the dirty list. We find the minimum in each row, which is equivalent to the best fit from the clean list.

distmatrix <- stringdist::stringdistmatrix(dirty_list_cl,clean_list_cl,method='jw',p=0.1)
best_fit <- apply(distmatrix,1,which.min) %>% as.integer()
similarity <- apply(distmatrix,1,min)

control$best_fit <- clean_list[best_fit]
control$distance <- round(similarity,3)
original best_fit distance
Marrriott Marriott Int’l. 0.089
Retired ResMed 0.203
Self Sealed Air 0.244
Haliburton Halliburton Co. 0.054
F5 Networks F5 Networks 0.000
Marrriott Marriott Int’l. 0.089
Illumin Illumina Inc 0.073
Mohaws Industry Mohawk Industries 0.113
MARSH MCLEANNON Marsh & McLennan 0.030
Self-employed Teleflex 0.306

Results

When we order the control dataframe by similarity we can find a suitable cutoff value (in this example 0.12) to separate real matches from false positives. This cutoff value depends on the application.

control$result <- ifelse(control$distance<=0.12,control$best_fit,NA)
original best_fit distance result
Marrriott Marriott Int’l. 0.089 Marriott Int’l.
Retired ResMed 0.203 NA
Self Sealed Air 0.244 NA
Haliburton Halliburton Co. 0.054 Halliburton Co.
F5 Networks F5 Networks 0.000 F5 Networks
Illumin Illumina Inc 0.073 Illumina Inc
Mohaws Industry Mohawk Industries 0.113 Mohawk Industries
MARSH MCLEANNON Marsh & McLennan 0.030 Marsh & McLennan
Self-employed Teleflex 0.306 NA
Mohawk Ind Mohawk Industries 0.088 Mohawk Industries
Mowahk Industries Mohawk Industries 0.017 Mohawk Industries
Harly Davidson Harley-Davidson 0.014 Harley-Davidson
f5 networks F5 Networks 0.000 F5 Networks
3M 3M Company 0.000 3M Company
Oracle Oracle Corp. 0.080 Oracle Corp.

Next Steps and other resources