An overview over fuzzy string matching problems and solutions. After reading this article you will know in which situations fuzzy string matching can be helpful, and know variations like fuzzy deduplication and record linkage.
I started drawing a map with draw.io to show the structure of the ideas I want to write about. I want to keep this post on a conceptual level with examples. I plan to write about useful packages and how to use R to work with fuzzy string matching in future posts.
This map will serve as an outline for the post.
Data Analysts and Data Scientists know how to work with different types of variables. Knowing if a variable is numeric, a (categorical) factor or a boolean is important for preprocessing, visualization and modeling. Fuzzy string matching is helpful when working with text input, specifically imperfect text input.
In general, I would distiguish two different types of imperfection in the text variable.
The problem with imperfection in text data is that analysis is not easily possible:
When we have a list of imperfect strings, there are two possibilities how to match them to their “perfect” counterparts.
Many differences can already be removed by cleaning the data. This could be:
After all the initial steps, it comes down to the core of the matching algorithm. We take two words (either one from the imperfect list and one from the perfect list, or we compare words from the imperfect list in a pairwise manner) and compare them.
We want to put a numerical value to the similarity of two strings (or alternatively to the distance, which is the opposite).
Intuitively, we “know” or “feel” that the strings baseball and basketball are rather similar and that breakfast and hello are not very similar.
What is the reasoning behind this feeling?
There are many algorithms which are used to measure similarity between strings. As always, too many choices can be intimidating, so I will name only two methods which I find useful for many applications.
Jaro-Winkler: takes into account the length of the strings, the number of shared letters, the number of transpositions of shared letters (e.g. example and exampel would be 1 transposition), and the number of matching letters among the first four letters. Jaro-Winkler is strong in cases when the imperfections happen mainly at the end of the string, for example through additional words (Company ABC International vs Company ABC Int).
Levenshtein distance: also known as Edit distance. This is the most famous and most used string distance. It counts, how many insertions, deletions and replacements of letters are needed to get from one string to the other.
If you are interested in more details or more string distance measures, I recommend this blogpost.
Which string distance measure works best depends highly on the type of imperfection in the data. It would even be possible to calculate two distance measures and use them in combination to find a suitable threshold.
After making a lot of string-by-string comparisons, the fuzzy string matching process is almost over. For each imperfect string we will have a closest match or several closest matches and can review the process.
To review the results I usually create a small data frame, containing the original string, the best fit, and the distance between both. In this data frame, I order the results by the distance.
By scrolling through the complete ordered list, we will see many cases with small distance, i.e. almost perfect matches. In some applications, we will see that there are matches which should not be matches (false positives) - i.e. McDonnell and McDonald’s. Those strings are very similar but should not be matches.
One chance to avoid false positives is to set a threshold, i.e. we would only consider a best-fit to be a match if it is similar enough. And this similarity threshold could be determined by eye (“at which level of similarity does the first false positive occur?”).
Doing this, we will create a second problem: false negatives. By setting a very low threshold, we would potentially miss a lot of actual matches.
This is a though challenge. We have to find a good compromise, between not being too strict and losing a lot of opportunities for matches and being too relaxed and replacing many strings by matches which are actually wrong. The decision depends a lot on the cost of false positives and false negatives.
In many applications you will not find a satisfying threshold from the beginning. You will see that there are some early false positives, but also many actual matches with lower similarity.
One possibility to improve the result is to ignore some words which do not actually help to identify matches, and rather disturb the process of string similarity calculation.
There is no golden rule about which words to ignore, but some guidelines:
So far, we have been looking at matching one variable. There are many applications where it is necessary to compare several columns.
Record Linkage is a subtype of fuzzy string matching where you want to check the identity of a person by information coming from different systems. Banks have to make sure that their customers are who they claim to be, and invest a lot of money and efforts into KYC (Know your customer) systems.
However, just checking the name will give tons of false positives (there are so many people who share the same name - i.e. that would be perfect matches). When you compare name, address and phone number, you can expect better results. Additional to the points we mentioned about cleaning and fuzzy matching, you would also have to put weights to different situations - what if name and address coincide and the phone number is missing in one source? What if the name is slightly similar and the address is the same?
As before, there is some manual checking necessary to create a good fuzzy string matching system.