For identifying similarities between strings in two CSV files

In this tutorial, we will learn how to match the strings in large datasets When working with huge datasets, especially in cases where data comes from different sources, it’s common to encounter slightly different versions of the same string. This post will guide you through performing fuzzy matching to identify and link similar strings between two large CSV files using Python. We’ll leverage the fuzzywuzzy library, which makes it easy to implement fuzzy matching.

To find matches between strings in two CSV files, use fuzzy matching on huge datasets.

Dataset Link Provided below:

InputData Link

Solution/Approach:

Step 1: Install Libraries

# Installing Libraries
pip install pandas rapidfuzz
Installing collected packages: rapidfuzz  
Successfully installed rapidfuzz-3.9.6
!pip install fuzzywuzzy[speedup]
Installing collected packages: fuzzywuzzy, Levenshtein, python-levenshtein 

Successfully installed Levenshtein-0.25.1 fuzzywuzzy-0.18.0 python-levenshtein-0.25.1
`pandas` is a robust Python data analysis and manipulation toolkit that is frequently used to handle structured data, such as data frames. Fast string matching library `rapidfuzz` provides methods comparable to `fuzzywuzzy`, but with improved speed. Using a C extension for quicker fuzzy string matching, the `fuzzywuzzy` library with the `speedup` option improves performance. Additionally, the `!} at the beginning permits shell commands in IPython environments or Jupyter notebooks.
Step 2: Load the Datasets
import pandas as pd
# The two datasets
df1 = pd.read_csv('/content/Customers_100K.csv', encoding='ISO-8859-1')
df2 = pd.read_csv('/content/Customers_1M.csv', encoding='ISO-8859-1')

For data manipulation, this code imports the `pandas` library as `pd}. After that, it reads two CSV files, Customers_100K.csv} and Customers_1M.csv}, into data frames, df1} and df2}, respectively, handling special characters with the provided encoding, ISO-8859-1}. The data from the corresponding CSV files is now contained in the data frames {df1} and {df2} for additional processing.

Step 3: Print the columns in Dataset
#print the columns in first dataset
print(df1.columns)
#print the columns in second dataset
print(df2.columns)
Index(['ID', 'NAME_', 'SURNAME', 'NAMESURNAME', 'GENDER', 'BIRTHDATE', 'EMAIL', 'TCNUMBER', 'TELNR', 'CITY', 'TOWN', 'DISTRICT', 'STREET', 'POSTALCODE', 'ADDRESSTEXT'], dtype='object') 

Index(['ID', 'NAME_', 'SURNAME', 'NAMESURNAME', 'GENDER', 'BIRTHDATE', 'EMAIL', 'TCNUMBER', 'TELNR', 'CITY', 'TOWN', 'DISTRICT', 'STREET', 'POSTALCODE', 'ADDRESSTEXT'], dtype='object')

The output demonstrates that the column names for df1 and df2 are the same. Nonetheless, each data frame’s first column name, “ID,” denotes the existence of a Byte Order Mark (BOM) artifact (). This BOM can create problems in data processing and usually shows up when files are stored with UTF-8 encoding. The following are the columns present in both data frames: “ID” (including BOM artifact) “NAME_,” “SURNAME,” “NAMESURNAME,” “GENDER,” “BIRTHDATE,” “EMAIL,” “TCNUMBER,” “TELNR,” “CITY,” “TOWN,” “DISTRICT,” “STREET,” “POSTALCODE,” and “ADDRESSTEXT.”

Step 4:  Apply the fuzzy_match function

# Adjust this list based on the actual columns in your dataframe
columns_to_concat = ['CITY']

# Create concatenated string in both dataframes
df1['Concatenated'] = df1[columns_to_concat].astype(str).agg(' '.join, axis=1)
df2['Concatenated'] = df2[columns_to_concat].astype(str).agg(' '.join, axis=1)

# Apply the fuzzy_match function
df1['Match'] = df1['Concatenated'].apply(lambda x: fuzzy_match(x, df_to_match=df2, column_to_match='Concatenated'))

This code combines the values in the specified columns (in this example, {‘CITY’}) into a single string, creating a new column {Concatenated} in both `df1} and `df2}. After that, it uses the concatenated column in `df2} to discover fuzzy matches when applying the `fuzzy_match} function to `df1[‘Concatenated’]}. The results are then stored in a new column called `Match} in `df1}.

Step 5: Matched File

matched_df = df1[df1['Match'].notna()]

`matched_df = df1[df1[‘Match’].notna()]} filters {df1} such that only rows with a `Match` column that is not {NaN} are included. By doing this, a new data frame called `matched_df} is produced that only includes the rows in `df1} for which a fuzzy match was discovered in `df2}.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top