This Python script matches company names between two datasets using fuzzy matching with the thefuzz library. It ensures that company names are standardized and finds the best match from another dataset.
Key Features:
- Loads two CSV datasets containing company names.
- Standardize the names by converting them to lowercase and removing extra spaces.
- It uses fuzzy matching to find the best match between company names from the two datasets.
- Handles missing files or columns gracefully.
CODE:
import pandas as pd
from thefuzz import fuzz, process
# Load the datasets
try:
df1 = pd.read_csv('d:/code speedy/t1/company_dataset1.csv') # Ensure the path is correct
df2 = pd.read_csv('d:/code speedy/t1/company_dataset2.csv') # Ensure the path is correct
except FileNotFoundError as e:
print(f"File not found: {e}")
exit(1) # Exit if files are not found
# Check if the required columns exist
if 'company_name' not in df1.columns or 'company_name' not in df2.columns:
print("One of the datasets is missing the 'company_name' column.")
exit(1)
# Clean the company names
df1['company_clean'] = df1['company_name'].str.lower().str.strip()
df2['company_clean'] = df2['company_name'].str.lower().str.strip()
# Match companies
def match_companies(row):
match = process.extractOne(row['company_clean'], df2['company_clean'], scorer=fuzz.token_sort_ratio)
return match
# Apply the matching function
df1['match'] = df1.apply(match_companies, axis=1)
# Display the results
print(df1[['company_clean', 'match']])
OUTPUT:
company_clean match 0 apple inc. (facebook inc., 57, 1) 1 microsoft corp. (facebook inc., 38, 1) 2 google llc (google llc, 100, 0)