Strategies for String Matching for Record Linkage in Python
Often, free-form data entry of text can lead to sloppy fields (e.g. spelling errors). This becomes an issue when the free-form text must be used to match other records (i.e. record linkage). Fuzzy-matching is one approach for solving this problem.
This post discusses two python approaches for string matching record linkage, one using a traditional method of calculating Levenshtein Distance between pairs with the fuzzywuzzy library, and another using the NLP algorithm, term frequency, inverse document frequency (TFIDF) from scikit-learn.
String Matching
String matching is typically necessary whenever records from differing systems must be associated. Even when both data sets have been “cleansed” mismatches may still happen because of formatting differences. Common sources of differences include:
- Varying prefixes or suffixes
(e.g. PT Indo Tambangraya Megah, Tbk vs Tamban graya Megah, Indonesia) - Abbreviations:
(e.g. MSFT Corp vs Microsoft Corporation) - Misspellings:
(e.g. Giordano International Limited, HK vs Gordano Intl’ Ltd., Hong Kong) - Extra or missing whitespace:
(e.g. Chu Kong Passenger Transport Co Ltd. vs ChuKong Passenger Trans. Co.)
Fuzzy String Matching
Fuzzy string matching, also called approximate string matching, is one common method for linking strings. It is called such because it matches strings based on closeness rather than exactness. Closeness (or edit distance) is measured by how many basic operations are needed to convert a string/sub-string into an exact match. The basic string operations include
- insertion: cot → coat
- deletion: coat → cot
- substitution: coat → cost
- transposition: cost → cots
Levenshtein Distance
Levenshtein Distance is one method used for measuring string edit distance. The distance is measured as the minimum number of single-character edits required to change one string into another. The Levenshtein Distance allows for substitution, insertion, deletion, but not transposition. For example, the strings “kitten” and “sitting” have a distance of 3:
- kitten → sitten
- sitten → sittin
- sittin → sitting
Besides the basic Leveshtein Distance, there are other related edit distance metrics:
- Damerau-Levenshtein distance: like Levenshtein but allows transposition of two adjacent characters,
- Longest Common Subsequence: allows only insert and delete but not substitution
- Hamming Distance: substitution only—applies only to strings of the same length,
- Jaro Distance: allows only for transposition.
FuzzyWuzzy: a Python Library for Fuzzy String Matching
FuzzyWuzzy is a string matching library that uses a Levenshtein distance library at its core. On some Linux distributions, it is available as a system package. It is also easily installed via pip:
pip install fuzzywuzzy
Once installed, a simple string match can be performed in python with the following:
>>> from fuzzywuzzy import fuzz >>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 91
Which shows a match score of 91. Fuzzywuzzy scores are given from 0 to 100, with higher numbers indicating a better match.
The process method in the library can be used to return the top matches for a string given a list of strings to search from. The following code illustrates an example of this:
>>> from fuzzywuzzy import process >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90) >>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)]
The first process statement shows the default result of returning a single tuple with the top matching string and its score. The second statement shows the option of giving the top two results.
One disadvantage to using the Levenshtein distance on a large set of records is that each string must be compared against all other records. That makes this a quadratic time algorithm: O(n2). For a small recordset, this may be acceptable, but for large sets (i.e. tens of thousands of records or more) the computational complexity quickly becomes impractical. For example, extrapolating from a simple test on my laptop with 15,000 records to be matched from a list of 25,000 possibilities indicates a processing time of more than 10 hours. For a record set in the millions, this would quickly become intractable.
This begs the question, “Is there a faster way to perform the matching?”
Term Frequency, Inverse Document Frequency
Term Frequency, Inverse Document Frequency (tf-idf) as a natural language processing (NLP) technique to measure how important a word is to a document in a collection of documents. For example, from a collection of documents, how important is the word “peanut”? The tf-idf value increases proportionally to the number of times a word appears (the term frequency) but is decreased by the number of documents that contain the word (inverse document frequency). The idf portion helps account for the fact that some words are more common in general (for example the word “is” doesn’t add information).
How does this relate to fuzzy string matching? A naive approach would be to try to use the words themselves, but this wouldn’t work with misspellings or transpositions. Instead, matching is done based on portions of words–groups of letters. These groups of letters are called “n-grams”, where n is the number of letters. For example:
to_be_or_not_to_be
as a 3-gram becomes the list
[to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be ]
So n-grams are created from the list of strings that will be used for matching. For a large list, the number of n-gram combinations is much smaller than the actual list of strings. The following function creates ngrams from each input string.
Creating n-grams
import re def ngrams(string, n=3): string = string.encode("ascii", errors="ignore").decode() string = string.lower() chars_to_remove = [‘)’, ‘(‘, ‘.’, ‘|’, ‘[‘, ‘]’, ‘{‘, ‘}’, "'"] rx = '[' + re.escape(''.join(chars_to_remove)) + ']' string = re.sub(rx, '', string) # remove the list of chars defined above string = string.replace('&', 'and') string = string.replace(',', ' ').replace('-', ' ') string = string.title() # Capital at start of each word string = re.sub(' +',' ',string).strip() # combine whitespace string = ' ' + string + ' ' # pad string = re.sub(r'[,-./]|\sBD', r'', string) ngrams = zip(*[string[i:] for i in range(n)]) return [''.join(ngram) for ngram in ngrams]
[source]
One of the clever features from the function above is the capitalization at the start of each word, which helps to improve the quality of the matches.
The following function performs tf-idf using K Nearest Neighbors from the sklearn library. The list from which the matches will be found (list2) is fit using the tf-idf vectorizer with the ngrams function above as the analyzer. Then k neighbors is run on the transform from the list to be matched (list1). For each string in list1, a tuple is returned giving the distance, string, and its match in list2. Note that for this method, the scores are given as distances, meaning lower numbers are better. (This is in contrast to the method in the previous section where larger numbers are better.)
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.neighbors import NearestNeighbors def tfidf_match(list1, list2): """For each item in list1, find the match in list2""" vectorizer = TfidfVectorizer(analyzer=ngrams, lowercase=False) tfidf = vectorizer.fit_transform(list2) nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf) distances, indices = nbrs.kneighbors(vectorizer.transform(list1)) matches = [(round(distances[i][0], 2), list1[i], list2[j[0]]) for i, j in enumerate(indices)] matches = pd.DataFrame(matches, columns=['score', 'original', 'matched']) return matches
The default nearest neighbor metric is Euclidean (L1 norm). Depending on the data set, other metrics may be preferable, including manhattan distance (L2 norm) or cosine similarity (angular difference).
Conclusion
For small data sets, the fuzzywuzzy python library is a great way to perform fuzzy string matching between record sets. However, as the record sets grow large, the required processing time can grow beyond what is acceptable for the application. An alternative approach is to use the NLP technique of tf-idf combined with k nearest neighbors to find matching strings using n-grams. This has the ability to match data sets in a fraction of the time.
References