This Python script matches company names between two datasets using fuzzy matching with the thefuzz library. It ensures that company names are standardized and finds the best match from another dataset.
Key Features:
- Loads two CSV datasets containing company names.
- Standardize the names by converting them to lowercase and removing extra spaces.
- It uses fuzzy matching to find the best match between company names from the two datasets.
- Handles missing files or columns gracefully.
CODE:
import pandas as pd from thefuzz import fuzz, process # Load the datasets try: df1 = pd.read_csv('d:/code speedy/t1/company_dataset1.csv') # Ensure the path is correct df2 = pd.read_csv('d:/code speedy/t1/company_dataset2.csv') # Ensure the path is correct except FileNotFoundError as e: print(f"File not found: {e}") exit(1) # Exit if files are not found # Check if the required columns exist if 'company_name' not in df1.columns or 'company_name' not in df2.columns: print("One of the datasets is missing the 'company_name' column.") exit(1) # Clean the company names df1['company_clean'] = df1['company_name'].str.lower().str.strip() df2['company_clean'] = df2['company_name'].str.lower().str.strip() # Match companies def match_companies(row): match = process.extractOne(row['company_clean'], df2['company_clean'], scorer=fuzz.token_sort_ratio) return match # Apply the matching function df1['match'] = df1.apply(match_companies, axis=1) # Display the results print(df1[['company_clean', 'match']])
OUTPUT:
company_clean match 0 apple inc. (facebook inc., 57, 1) 1 microsoft corp. (facebook inc., 38, 1) 2 google llc (google llc, 100, 0)