Company name matching in Python from two csv datasets

By Amshuman Gajula / November 24, 2024

Views: 5

This Python script matches company names between two datasets using fuzzy matching with the thefuzz library. It ensures that company names are standardized and finds the best match from another dataset.

Key Features:

Loads two CSV datasets containing company names.
Standardize the names by converting them to lowercase and removing extra spaces.
It uses fuzzy matching to find the best match between company names from the two datasets.
Handles missing files or columns gracefully.

CODE:

import pandas as pd
from thefuzz import fuzz, process

# Load the datasets
try:
    df1 = pd.read_csv('d:/code speedy/t1/company_dataset1.csv')  # Ensure the path is correct
    df2 = pd.read_csv('d:/code speedy/t1/company_dataset2.csv')  # Ensure the path is correct
except FileNotFoundError as e:
    print(f"File not found: {e}")
    exit(1)  # Exit if files are not found

# Check if the required columns exist
if 'company_name' not in df1.columns or 'company_name' not in df2.columns:
    print("One of the datasets is missing the 'company_name' column.")
    exit(1)

# Clean the company names
df1['company_clean'] = df1['company_name'].str.lower().str.strip()
df2['company_clean'] = df2['company_name'].str.lower().str.strip()

# Match companies
def match_companies(row):
    match = process.extractOne(row['company_clean'], df2['company_clean'], scorer=fuzz.token_sort_ratio)
    return match

# Apply the matching function
df1['match'] = df1.apply(match_companies, axis=1)

# Display the results
print(df1[['company_clean', 'match']])

OUTPUT:

company_clean match
0 apple inc. (facebook inc., 57, 1)
1 microsoft corp. (facebook inc., 38, 1)
2 google llc (google llc, 100, 0)

EXPLANATION: This program loads two datasets containing company names and cleans the names by converting them to lowercase and removing extra spaces. It then uses fuzzy matching with thefuzz library to find the closest match for each company in the first dataset from the second dataset. The best match and its similarity score are stored in a new column. Finally, the results are displayed for comparison.

Leave a Comment Cancel Reply