In this tutorial, we will learn how to match the strings in large datasets When working with huge datasets, especially in cases where data comes from different sources, it’s common to encounter slightly different versions of the same string. This post will guide you through performing fuzzy matching to identify and link similar strings between two large CSV files using Python. We’ll leverage the fuzzywuzzy library, which makes it easy to implement fuzzy matching.
To find matches between strings in two CSV files, use fuzzy matching on huge datasets.
Dataset Link Provided below:
`pandas` is a robust Python data analysis and manipulation toolkit that is frequently used to handle structured data, such as data frames. Fast string matching library `rapidfuzz` provides methods comparable to `fuzzywuzzy`, but with improved speed. Using a C extension for quicker fuzzy string matching, the `fuzzywuzzy` library with the `speedup` option improves performance. Additionally, the `!} at the beginning permits shell commands in IPython environments or Jupyter notebooks.
Step 2: Load the Datasets
import pandas as pd # The two datasets df1 = pd.read_csv('/content/Customers_100K.csv', encoding='ISO-8859-1') df2 = pd.read_csv('/content/Customers_1M.csv', encoding='ISO-8859-1')
For data manipulation, this code imports the `pandas` library as `pd}. After that, it reads two CSV files, Customers_100K.csv} and Customers_1M.csv}, into data frames, df1} and df2}, respectively, handling special characters with the provided encoding, ISO-8859-1}. The data from the corresponding CSV files is now contained in the data frames {df1} and {df2} for additional processing.
Step 3: Print the columns in Dataset
#print the columns in first dataset print(df1.columns) #print the columns in second dataset print(df2.columns)
Index(['ID', 'NAME_', 'SURNAME', 'NAMESURNAME', 'GENDER', 'BIRTHDATE', 'EMAIL', 'TCNUMBER', 'TELNR', 'CITY', 'TOWN', 'DISTRICT', 'STREET', 'POSTALCODE', 'ADDRESSTEXT'], dtype='object') Index(['ID', 'NAME_', 'SURNAME', 'NAMESURNAME', 'GENDER', 'BIRTHDATE', 'EMAIL', 'TCNUMBER', 'TELNR', 'CITY', 'TOWN', 'DISTRICT', 'STREET', 'POSTALCODE', 'ADDRESSTEXT'], dtype='object')