INTRODUCTION
This program helps users quickly find company names by comparing an input with a dataset using fuzzy matching. It starts by normalizing the input, standardizing the format for accuracy, and then runs an algorithm to identify the closest matches based on similarity scores. After finding the best match, the program displays it, and users can easily choose to view other similar company names if needed.
PROGRAM
Creating CSV File(Company Dataset)
company_name |
Google LLC |
Google Inc. |
Microsoft Corporation |
Microsoft Corp. |
Apple Inc. |
Apple Computer, Inc. |
Amazon.com, Inc. |
Amazon Inc. |
Amazon Web Services (AWS) |
Meta Platforms, Inc. |
Facebook, Inc. |
Intel Corporation |
Intel Corp |
Mercedes-Benz Group AG |
Mercedes-Benz USA LLC |
Hyundai Motor Company |
Hyundai Motors Corp. |
The Walt Disney Company |
Disney Corp. |
Netflix, Inc. |
Netflix Corporation |
Spotify Technology S.A. |
Spotify Inc |
Vodafone Group PLC |
Vodafone Inc. |
Visa Inc. |
Visa USA |
Program For Company Name Matching From The dataset In Python
import pandas as pd from thefuzz import fuzz from thefuzz import process df = pd.read_csv('companies.csv') def normalize(name): if pd.isna(name): return '' name = name.lower().strip() name = name.replace('&', 'and') name = ''.join(e for e in name if e.isalnum() or e == ' ') return name def match_company(name, target_df, limit=3): normalized = target_df['normalized_name'].tolist() matches = process.extract(name, normalized, scorer=fuzz.ratio, limit=limit) results = [(target_df.iloc[normalized.index(match)]['company_name'], score) for match, score in matches] return results df['normalized_name'] = df['company_name'].apply(normalize) target_companies = df[['company_name', 'normalized_name']] user_input = input("Enter The Company Name You Want To Match: ").strip() normalized_input = normalize(user_input) if normalized_input: similar_matches = match_company(normalized_input, target_companies, limit=3) if similar_matches: best_match, best_score = similar_matches[0] print(f" Best Match Found: {best_match} With A Match Score Of {best_score}\n") see_similar = input(" Would You Like To See Other Similar Company Names? (yes/no): ").lower() if see_similar == 'yes': print("\n Here Are The Most Similar Company Names:") for i, (match, score) in enumerate(similar_matches): print(f"{i+1}. {match} (Match Score: {score})") choice = input("\n Pick A Company By Entering The Number (1-3): ") try: choice = int(choice) - 1 if 0 <= choice < len(similar_matches): selected_match, selected_score = similar_matches[choice] print(f"\n You Selected: {selected_match} With A Match Score Of {selected_score}") else: print("\n Invalid Selection. Please Run The Program Again And Select A Valid Number.") except ValueError: print("\n Invalid Input. Please Enter A Valid Number.") else: print("\n No Close Matches Found. Please Try Again With A Different Name.") else: print("Empty Input. Please Enter A Valid Company Name.") view_dataset = input("\n Would You Like To See The Whole Dataset? (yes/no): ").lower() if view_dataset == 'yes': print("Here Is The Entire Dataset:") print(df['company_name']) else: print("Exiting The Program!")
Let’s break down the program step by step to understand it clearly
Importing Libraries and Reading the Data
import pandas as pd from thefuzz import fuzz from thefuzz import process df = pd.read_csv('companies.csv')
- First, you import pandas and thefuzz libraries. pandas handles data manipulation, while thefuzz is used for string matching.
- You then read a CSV file called companies.csv into a pandas DataFrame (df), which will store the list of companies.
Defining the normalize Function
def normalize(name): if pd.isna(name): return '' name = name.lower().strip() name = name.replace('&', 'and') name = ''.join(e for e in name if e.isalnum() or e == ' ') return name
This function is crucial for cleaning and standardizing the company names:
- It first checks if the name is missing (Not a Number), and if so, returns an empty string.
- Next, it converts the name to lowercase and removes extra spaces from the start and end.
- It replaces the ampersand symbol (&) with “and”.
- Finally, it strips out all non-alphanumeric characters except spaces, leaving only letters, numbers, and spaces.
Defining the match_company Function
def match_company(name, target_df, limit=3): normalized = target_df['normalized_name'].tolist() matches = process.extract(name, normalized, scorer=fuzz.ratio, limit=limit) results = [(target_df.iloc[normalized.index(match)]['company_name'], score) for match, score in matches] return results
This function handles the company name matching:
- It takes the normalized input name and compares it to a list of normalized names from the DataFrame.
- The process.extract() function from thefuzz compares the input name with the list of normalized names using the fuzz.ratio similarity score.
- It returns the best limit number of matches ( Default is 3).
- For each match, it retrieves the original company name and the corresponding match score.
Normalizing the Company Names
df['normalized_name'] = df['company_name'].apply(normalize) target_companies = df[['company_name', 'normalized_name']]
- Here, the program applies the normalize function to every company name in the DataFrame.
- It stores both the original company names and their normalized versions in target_companies
Getting User Input
user_input = input("Enter The Company Name You Want To Match: ").strip() normalized_input = normalize(user_input)
- The program prompts the user to input a company name.
- After that, it normalizes the input using the normalize function to prepare it for comparison.
Matching and Displaying Results
if normalized_input: similar_matches = match_company(normalized_input, target_companies, limit=3) if similar_matches: best_match, best_score = similar_matches[0] print(f" Best Match Found: {best_match} With A Match Score Of {best_score}\n")
- If the input is not empty, the program calls the match_company function to find similar names.
- It checks if any matches are found:
- If matches exist, it prints the best match (the one with the highest similarity score)
Asking to See Similar Matches
see_similar = input(" Would You Like To See Other Similar Company Names? (yes/no): ").lower() if see_similar == 'yes': print("\n Here Are The Most Similar Company Names:") for i, (match, score) in enumerate(similar_matches): print(f"{i+1}. {match} (Match Score: {score})")
- The program asks the user if they want to see other similar company names.
- If the user answers “yes”, it displays the top 3 matches along with their similarity scores.
Selecting a Match
choice = input("\n Pick A Company By Entering The Number (1-3): ") try: choice = int(choice) - 1 if 0 <= choice < len(similar_matches): selected_match, selected_score = similar_matches[choice] print(f"\n You Selected: {selected_match} With A Match Score Of {selected_score}") else: print("\n Invalid Selection. Please Run The Program Again And Select A Valid Number.") except ValueError: print("\n Invalid Input. Please Enter A Valid Number.")
- The user is prompted to select a company by entering the number corresponding to one of the displayed matches.
- If the user makes a valid selection, the chosen company name and its score are displayed.
- If the input is invalid (out of range or non-numeric), the program handles the error gracefully.
Handling No Matches or Empty Input
else: print("\n No Close Matches Found. Please Try Again With A Different Name.") else: print("Empty Input. Please Enter A Valid Company Name.")
If no close matches are found, or if the user input is empty, the program informs the user
Offering to View the Entire Dataset
view_dataset = input("\n Would You Like To See The Whole Dataset? (yes/no): ").lower() if view_dataset == 'yes': print("Here Is The Entire Dataset:") print(df['company_name']) else: print("Exiting The Program!")
- At the end, the program asks the user if they want to view the entire dataset.
- If the user chooses “yes”, it prints the list of all company names. If not, the program exits.
OUTPUT
Enter The Company Name You Want To Match: amazon i Best Match Found: Amazon Inc. With A Match Score Of 89 Would You Like To See Other Similar Company Names? (yes/no): yes Here Are The Most Similar Company Names: 1. Amazon Inc. (Match Score: 89) 2. Amazon.com, Inc. (Match Score: 76) 3. Amazon Web Services (AWS) (Match Score: 52) Pick A Company By Entering The Number (1-3): 2 You Selected: Amazon.com, Inc. With A Match Score Of 76 Would You Like To See The Whole Dataset? (yes/no): yes Here Is The Entire Dataset: 0 Google LLC 1 Google Inc. 2 Microsoft Corporation 3 Microsoft Corp. 4 Apple Inc. 5 Apple Computer, Inc. 6 Amazon.com, Inc. 7 Amazon Inc. 8 Amazon Web Services (AWS) 9 Meta Platforms, Inc. 10 Facebook, Inc. 11 Intel Corporation 12 Intel Corp 13 Mercedes-Benz Group AG 14 Mercedes-Benz USA LLC 15 Hyundai Motor Company 16 Hyundai Motors Corp. 17 The Walt Disney Company 18 Disney Corp. 19 Netflix, Inc. 20 Netflix Corporation 21 Spotify Technology S.A. 22 Spotify Inc 23 Vodafone Group PLC 24 Vodafone Inc. 25 Visa Inc. 26 Visa USA Name: company_name, dtype: object
CONCLUSION
In conclusion, this program effectively cleans and normalizes company names, then uses fuzzy matching to find the closest matches to user input. It starts by handling missing data and standardizing the names for consistency. After receiving the user’s input, it identifies the top matches and allows further exploration of similar names. The program also offers the option to view the entire dataset. Through smooth transitions, it provides a user-friendly experience for matching and selecting company names.