INTRODUCTION :
In this tutorial, you will learn how to approximately match strings and determine how similar they are by going over various examples.
A human may be able to distinguish the intention of a misspelled word with a quick glance. For a computer, the distinction is not as clear-cut.
Fuzzy string matching is the colloquial name used for approximate string matching – we will stick with the term fuzzy string matching for this tutorial. It’s a technique used to identify two elements of text strings that match partially but not exactly.
We typically see this phenomenon used in search engines. For example, if a user were to type “Londin” instead of “London” into Google, fuzzy string matching would identify that “London” was the intended word, and Google would return search results for that.
In this article, you will learn:
- How the fuzzy string matching algorithm determines the closeness of two strings using the Levenshtien edit distance.
- How to perform simple fuzzy string matching in Python using TheFuzz library.
- Some advanced fuzzy string matching techniques using TheFuzz advanced matches.
- How to integrate the TheFuzz library with Pandas.
Edits and edit distance
The fuzzy string matching algorithm seeks to determine the degree of closeness between two different strings. This is discovered using a distance metric known as the “edit distance.” The edit distance determines how close two strings are by finding the minimum number of “edits” required to transform one string to another.
If the edit distance counts the number of edit operations to tell us how many operations away one string is from another, an edit is an operation performed on a string to transform it into another string.
There are four main types of edits:
-
- Insert (add a letter)
- Delete (remove a letter)
- Switch (swap two adjacent letters)
- Replace (change one letter to another)
These four edit operations make it possible to modify any string.
Combining the edit operations together enables you to discover the list of possible strings that are N edits away, where N is the number of edit operations. For example, the edit distance between “London” and “Londin” is one since replacing the “i” with an “o” leads to an exact match.
But how specifically is the edit distance calculated, you ask?
There are different variations of how to calculate edit distance. For instance, there is the Levenshtein distance, Hamming distance, Jaro distance, and more.
The Levenshtein Distance
In this tutorial, we are only concerned with the Levenshtein distance.
It’s a metric named after Vladimir Levenshtein, who originally considered it in 1965 to measure the difference between two sequences of words. We can use it to discover the minimum number of edits that you need to do to change a one-word sequence into the other.
Here’s the formal calculation:
Wheredenotes 0 when a=b and 1 otherwise.
It is important to note that the rows on the minimum above correspond to a deletion, an insertion, and a substitution in that order.
It is also possible to calculate the Levenshtein similarity ratio based on the Levenshtein distance. This can be done using the following formula:
where |a| and |b| are the lengths of a sequence and b sequence, respectively.
Simple Fuzzy String Matching
One of the most popular packages for fuzzy string matching in Python was FuzzyWuzzy. However, FuzzyWuzzy was updated and renamed in 2021. It now goes by the name TheFuzz.
TheFuzz still holds as one of the most advanced open-source libraries for fuzzy string matching in Python. It was first developed by SeatGeek for the purpose of distinguishing whether two ticket listings with similar names were for the same event.
In accordance with FuzzyWuzzy, TheFuzz uses the Levenshtein edit distance to calculate the degree of closeness between two strings. It also provides features for determining string similarity in various situations, as you will see in this tutorial.
First, we must install thefuzz package. You can do this with pip using the following command :
!pip install thefuzz
Code Explanation :
This code installs the Python package called “thefuzz” using the pip package manager. The exclamation mark at the beginning of the line indicates that this is a command to be executed in the command line or terminal, rather than Python code. The “pip” command is used to install Python packages, and “thefuzz” is the name of the package being installed.
Simple ratio
We can determine the simple ratio between two strings using the ratio() method on the fuzz object. This simply calculates the edit distance based on the ordering of both input strings difflib.ratio()
– see the difflib documentation to learn more.
# Check the similarity score name = "Kurtis Pykes" full_name = "Kurtis K D Pykes" print(f"Similarity score: {fuzz.ratio(name, full_name)}") """ Similarity Score: 86 """
code explanation :
This code uses the fuzzywuzzy
library to calculate the similarity score between two strings.
First, two strings are defined: name
and full_name
. Then, the fuzz.ratio()
function is used to calculate the similarity score between these two strings. The similarity score is a number between 0 and 100, where 0 means the strings are completely dissimilar and 100 means they are identical.
Finally, the similarity score is printed using an f-string. In this case, the similarity score between “Kurtis Pykes” and “Kurtis K D Pykes” is 86, indicating a relatively high degree of similarity between the two strings.
In the code, we used two variations of my name to compare the similarity score, which was given as 86.
Let’s check this against the partial ratio.
Partial ratio
To check the partial ratio, all we must do to the code above is call partial_ratio()
on our fuzz object instead of ratio().
# Check the similarity score name = "Kurtis Pykes" full_name = "Kurtis K D Pykes" print(f"Similarity score: {fuzz.partial_ratio(name, full_name)}") """ Similarity Score: 67 """
Code Explanation :
This code uses the fuzz
module from the fuzzywuzzy
library to calculate the similarity score between two strings.
First, two strings are defined: name
and full_name
. Then, the partial_ratio()
function from the fuzz
module is used to calculate the similarity score between the two strings. The similarity score is a number between 0 and 100, where 0 means the strings are completely dissimilar and 100 means they are identical.
Finally, the similarity score is printed using an f-string. In this case, the similarity score between “Kurtis Pykes” and “Kurtis K D Pykes” is 67.
We see a decrease. What is going on?
Well, the partial_ratio()
seeks to find how partially similar two strings are. Two strings are partially similar if they have some of the words in a common order.
The partial_ratio()
calculates the similarity by taking the shortest string, which in this scenario is stored in the variable name, then compares it against the sub-strings of the same length in the longer string, which is stored in full_name.
Since order matters in partial ratio, our score dropped in this instance. Therefore, to get a 100% similarity match, you would have to move the “K D” part (signifying my middle name) to the end of the string. For example:
# Order matters with partial ratio # Check the similarity score name = "Kurtis Pykes" full_name = "Kurtis Pykes K D" print(f"Partial ratio similarity score: {fuzz.partial_ratio(name, full_name)}") # But order will not effect simple ratio if strings do not match print(f"Simple ratio similarity score: {fuzz.ratio(name, full_name)}") """ Partial ratio similarity score: 100 Simple ratio similarity score: 86 """
Code explanation :
This code uses the fuzzywuzzy library to compare the similarity between two strings, name
and full_name
.
The fuzz.partial_ratio()
function is used to calculate the similarity score between the two strings, taking into account the order of the words. In this case, since full_name
contains all the words in name
, the similarity score is 100.
The fuzz.ratio()
function is used to calculate the similarity score between the two strings, but it does not take into account the order of the words. In this case, since full_name
contains all the words in name
, the similarity score is still relatively high at 86.
The output of the code is the similarity scores for both functions, which are printed using f-strings.
So what if we want our fuzzy string matcher to ignore order?
Then you may want to use the “token sort ratio.”
Token sort ratio
Okay, so we want to ignore the ordering of the words in the strings but still determine how similar they are – token sort helps you do exactly that. Token sort doesn’t care about what order words occur in. It accounts for similar strings that aren’t in order as expressed above.
Thus, we should get a 100% score using token sort ratio with the most recent example:
# Check the similarity score full_name = "Kurtis K D Pykes" full_name_reordered = "Kurtis Pykes K D" # Order does not matter for token sort ratio print(f"Token sort ratio similarity score: {fuzz.token_sort_ratio(full_name_reordered, full_name)}") # Order matters for partial ratio print(f"Partial ratio similarity score: {fuzz.partial_ratio(full_name, full_name_reordered)}") # Order will not effect simple ratio if strings do not match print(f"Simple ratio similarity score: {fuzz.ratio(name, full_name)}") """ Token sort ratio similarity score: 100 Partial ratio similarity score: 75 Simple ratio similarity score: 86 """
This code uses the fuzzywuzzy library to compare the similarity between two strings.
First, it defines two variables full_name
and full_name_reordered
which contain two different versions of the same name.
Then, it uses the fuzz.token_sort_ratio()
function to compare the similarity between the two names. This function compares the two strings after sorting the individual words in each string alphabetically. Since the order of the words does not matter in this case, the similarity score is 100.
Next, it uses the fuzz.partial_ratio()
function to compare the similarity between the two names. This function compares the two strings based on the longest contiguous matching substring. Since the two names share the same first and last name, the similarity score is 75.
Finally, it uses the fuzz.ratio()
function to compare the similarity between the two names. This function compares the two strings based on the number of matching characters. Since the two names are not exactly the same, the similarity score is 86.
All three similarity scores are printed out using formatted strings.
Let’s go back to the original name and full_name variables. What do you think will happen if we use token sort now?
# Check the similarity score name = "Kurtis Pykes" full_name = "Kurtis K D Pykes" print(f"Token sort ratio similarity score: {fuzz.token_sort_ratio(name, full_name)}") """ Token sort ratio similarity score: 86 """
Code explanation :
This code uses the fuzzywuzzy
library to calculate the similarity score between two strings. The fuzz.token_sort_ratio()
function takes two string arguments and returns a score between 0 and 100, where 0 means no similarity and 100 means exact match.
In this code, the name
variable is set to “Kurtis Pykes” and the full_name
variable is set to “Kurtis K D Pykes”. The fuzz.token_sort_ratio()
function is then called with these two variables as arguments, and the resulting score is printed using an f-string.
The token_sort_ratio()
function works by first tokenizing the strings (i.e. breaking them up into individual words), sorting the tokens alphabetically, and then comparing the sorted tokens to calculate the similarity score. In this case, the score is 86, which indicates a relatively high degree of similarity between the two strings.
The score drops.
This is because token sort only disregards order. If there are words that are dissimilar words in the strings, it will negatively impact the similarity ratio, as we’ve seen above.
But there is a workaround.
Token set ratio
The token_set_ratio()
method is pretty similar to the token_sort_ratio()
, except it takes out common tokens before calculating how similar the strings are: this is extremely helpful when the strings are significantly different in length.
Since the name and full_name variables both have “Kurtis Pykes” in them, we can expect the token set ratio similarity to be 100%.
# Check the similarity score name = "Kurtis Pykes" full_name = "Kurtis K D Pykes" print(f"Token sort ratio similarity score: {fuzz.token_set_ratio(name, full_name)}") """ Token sort ratio similarity score: 100 """
Code explanation :
This code uses the fuzzywuzzy
library to calculate the similarity score between two strings. The fuzz.token_set_ratio()
function compares two strings and returns a score based on how similar they are. In this case, the name
variable is compared to the full_name
variable. The print()
function is used to display the similarity score as a formatted string. The output shows that the similarity score between the two strings is 100, which means they are an exact match.
Process
The process module enables users to extract text from a collection using fuzzy string matching. Calling the extract()
method on the process module returns the strings with a similarity score in a vector. For example:
from thefuzz import process collection = ["AFC Barcelona", "Barcelona AFC", "barcelona fc", "afc barcalona"] print(process.extract("barcelona", collection, scorer=fuzz.ratio)) """ [('barcelona fc', 86), ('AFC Barcelona', 82), ('Barcelona AFC', 82), ('afc barcalona', 73)] """
Code explanation :
This code uses the thefuzz
library to perform fuzzy string matching.
First, the process
function is imported from thefuzz
.
Then, a list of strings called collection
is defined, which contains variations of the string “Barcelona”.
Next, the process.extract
function is called with three arguments: the string to match (“barcelona”), the collection of strings to match against (collection
), and the scorer
argument set to fuzz.ratio
. The fuzz.ratio
scorer calculates the Levenshtein distance between two strings and returns a score between 0 and 100, where 100 is a perfect match.
The output of process.extract
is a list of tuples, where each tuple contains a string from the collection
and its corresponding score. The list is sorted in descending order by score.
Finally, the output is printed to the console, showing the four strings in collection
and their scores in descending order of similarity to the target string “barcelona”.
We can control the length of the vector returned by the extract() method by setting the limit parameter to the desired length.
print(process.extract("barcelona", collection, scorer=fuzz.ratio)) """ [('barcelona fc', 86), ('AFC Barcelona', 82), ('Barcelona AFC', 82)] """
Code explanation:
This code uses the fuzzywuzzy
library to perform fuzzy string matching.
The process.extract()
function takes three arguments: the string to match (“barcelona”), the collection of strings to match against (collection
), and the scorer
function to use for matching (in this case, fuzz.ratio
).
The fuzz.ratio
scorer calculates the Levenshtein distance between two strings (i.e. the number of edits needed to transform one string into the other) and returns a score between 0 and 100, where 100 is a perfect match.
The output of the process.extract()
function is a list of tuples, where each tuple contains a matching string from the collection and its corresponding score. In this case, the output shows that “barcelona fc” is the closest match with a score of 86, followed by “AFC Barcelona” and “Barcelona AFC” with scores of 82.
Finally, the print()
function is used to display the output to the user.
In this instance, the extract() returns the three closest matching strings based on the scorer we defined.
Fuzzy String Matching with pandas
In this section, we will see how to do fuzzy string matching on a pandas dataframe.
Let’s say you have some data you have exported into a pandas dataframe, and you would like to join it to the existing data you have.
import pandas as pd # Creating a dataframe dict_one = { "country": ["England", "Scotland", "Wales", "United Kingdom", "Northern Ireland"], "population_in_millions": [55.98, 5.45, 3.14, 67.33, 1.89] } dict_two = { "country": ["Northern Iland", "Wles", "Scotlnd", "Englnd", "United K."], "GDP_per_capita": [24900, 23882, 37460, 45101, 46510.28] } existing_data = pd.DataFrame(dict_one) exported_data = pd.DataFrame(dict_two) print(existing_data, exported_data, sep="\n\n") """ country population_in_millions 0 England 55.98 1 Scotland 5.45 2 Wales 3.14 3 United Kingdom 67.33 4 Northern Ireland 1.89 country GDP_per_capita 0 Northern Iland 24900.00 1 Wles 23882.00 2 Scotlnd 37460.00 3 Englnd 45101.00 4 United K. 46510.28 """
Code explanation :
This code imports the pandas library and creates two dictionaries, dict_one
and dict_two
, which are then used to create two separate dataframes, existing_data
and exported_data
, respectively.
The existing_data
dataframe has two columns, “country” and “population_in_millions”, while the exported_data
dataframe has two columns, “country” and “GDP_per_capita”.
Finally, the print()
function is used to display both dataframes side by side, with a separator of two new lines (\n\n
) between them.
There’s a major problem.
The existing data has the correct spellings for the countries, but the exported data does not. If we attempt to join the two dataframes on the country column, pandas would not recognize the misspelled words as being equal to the correctly spelled words. Thus, the result returned from the merge function will not be as expected.
Here is what would happen if we tried:
# Attempt to join the two dataframe data = pd.merge(existing_data, exported_data, on="country", how="left") print(data.head()) """ country population_in_millions GDP_per_capita 0 England 55.98 NaN 1 Scotland 5.45 NaN 2 Wales 3.14 NaN 3 United Kingdom 67.33 NaN 4 Northern Ireland 1.89 NaN """
Code explanation:
This code uses the pandas library in Python to merge two dataframes, existing_data
and exported_data
, based on a common column called “country”. The resulting merged dataframe is stored in a variable called data
.
The how
parameter is set to “left”, which means that all the rows from the existing_data
dataframe will be included in the merged dataframe, and only the matching rows from the exported_data
dataframe will be included. If there are any rows in existing_data
that do not have a match in exported_data
, those rows will still be included in the merged dataframe, but with missing values (NaN) in the columns from exported_data
.
The print
statement is used to display the first five rows of the merged dataframe using the head()
method. The resulting output shows the country names and the population data from existing_data
, but with missing values in the GDP_per_capita column from exported_data
This defeats the whole purpose of attempting to merge these dataframes together.
However, we can get around this problem with fuzzy string matching.
Let’s see how it looks in code:
# Rename the misspelled columns exported_data["country"] = exported_data["country"].apply( lambda x: process.extractOne(x, existing_data["country"], scorer=fuzz.partial_ratio)[0] ) # Attempt to join the two dataframe data = pd.merge(existing_data, exported_data, on="country", how="left") print(data.head()) """ country population_in_millions GDP_per_capita 0 England 55.98 45101.00 1 Scotland 5.45 37460.00 2 Wales 3.14 23882.00 3 United Kingdom 67.33 46510.28 4 Northern Ireland 1.89 24900.00 """
Code explanation :
This code is written in Python and it is used to merge two dataframes based on a common column “country”.
The first step is to rename the misspelled columns in the “exported_data” dataframe using the “apply” method. The “apply” method applies a function to each element of a pandas series. In this case, the function is a lambda function that uses the “extractOne” method from the “fuzzywuzzy” library to find the closest match for each country name in the “existing_data” dataframe. The “scorer” parameter is set to “fuzz.partial_ratio” which is a fuzzy matching algorithm that calculates the ratio of matching substrings between two strings.
The second step is to merge the two dataframes using the “merge” method from pandas. The “on” parameter specifies the common column to merge on, which is “country” in this case. The “how” parameter is set to “left” which means that all the rows from the “existing_data” dataframe will be included in the merged dataframe, and only the matching rows from the “exported_data” dataframe will be included.
Finally, the merged dataframe is printed using the “head” method to display the first five rows. The resulting dataframe contains the population in millions and GDP per capita for each country in the merged dataset.
In this code, we have renamed the misspelled values in country column of the exported data using a neat lambda function in association with the process.extractOne()
method. Note, we used an index of 0 on the result of the extractOne()
to return online the similar string instead of a list containing the string and similarity value.
Next, we merged the dataframes on the country column using a left join. The result is a single dataframe containing the correctly spelled countries (including the political union of the United Kingdom).
Final thoughts
In this tutorial, you have learned:
- Edits and edit distance
- The Levenshtein edit distance and how it works
- How to do fuzzy string matching in Python with thefuzz library
- How to do fuzzy string matching with Pandas dataframes.
The examples presented here may be simple, but they are enough to illustrate how to handle various cases of what a computer thinks are mismatching strings. There are several applications of fuzzy matching in areas such as spell-checking and bioinformatics, where fuzzy logic is used to match DNA sequences.