Company Name Matching From The dataset In Python

INTRODUCTION

This program helps users quickly find company names by comparing an input with a dataset using fuzzy matching. It starts by normalizing the input, standardizing the format for accuracy, and then runs an algorithm to identify the closest matches based on similarity scores. After finding the best match, the program displays it, and users can easily choose to view other similar company names if needed.

PROGRAM

Creating CSV File(Company Dataset)

company_name
Google LLC
Google Inc.
Microsoft Corporation
Microsoft Corp.
Apple Inc.
Apple Computer, Inc.
Amazon.com, Inc.
Amazon Inc.
Amazon Web Services (AWS)
Meta Platforms, Inc.
Facebook, Inc.
Intel Corporation
Intel Corp
Mercedes-Benz Group AG
Mercedes-Benz USA LLC
Hyundai Motor Company
Hyundai Motors Corp.
The Walt Disney Company
Disney Corp.
Netflix, Inc.
Netflix Corporation
Spotify Technology S.A.
Spotify Inc
Vodafone Group PLC
Vodafone Inc.
Visa Inc.
Visa USA

Program For Company Name Matching From The dataset In Python

import pandas as pd
from thefuzz import fuzz
from thefuzz import process

df = pd.read_csv('companies.csv')

def normalize(name):
    if pd.isna(name):  
        return ''
    name = name.lower().strip()
    name = name.replace('&', 'and')
    name = ''.join(e for e in name if e.isalnum() or e == ' ')
    return name

def match_company(name, target_df, limit=3):
    normalized = target_df['normalized_name'].tolist()
    matches = process.extract(name, normalized, scorer=fuzz.ratio, limit=limit)
    results = [(target_df.iloc[normalized.index(match)]['company_name'], score) for match, score in matches]
    return results

df['normalized_name'] = df['company_name'].apply(normalize)
target_companies = df[['company_name', 'normalized_name']]

user_input = input("Enter The Company Name You Want To Match: ").strip()
normalized_input = normalize(user_input)

if normalized_input:
    similar_matches = match_company(normalized_input, target_companies, limit=3)
    
    if similar_matches:
        best_match, best_score = similar_matches[0]
        print(f" Best Match Found: {best_match} With A Match Score Of {best_score}\n")

        see_similar = input(" Would You Like To See Other Similar Company Names? (yes/no): ").lower()

        if see_similar == 'yes':
            print("\n Here Are The Most Similar Company Names:")
            for i, (match, score) in enumerate(similar_matches):
                print(f"{i+1}. {match} (Match Score: {score})")

            choice = input("\n Pick A Company By Entering The Number (1-3): ")

            try:
                choice = int(choice) - 1
                if 0 <= choice < len(similar_matches):
                    selected_match, selected_score = similar_matches[choice]
                    print(f"\n You Selected: {selected_match} With A Match Score Of {selected_score}")
                else:
                    print("\n Invalid Selection. Please Run The Program Again And Select A Valid Number.")
            except ValueError:
                print("\n Invalid Input. Please Enter A Valid Number.")
    else:
        print("\n No Close Matches Found. Please Try Again With A Different Name.")
else:
    print("Empty Input. Please Enter A Valid Company Name.")

view_dataset = input("\n Would You Like To See The Whole Dataset? (yes/no): ").lower()

if view_dataset == 'yes':
    print("Here Is The Entire Dataset:")
    print(df['company_name'])
else:
    print("Exiting The Program!")

Let’s break down the program step by step to understand it clearly

Importing Libraries and Reading the Data

import pandas as pd
from thefuzz import fuzz
from thefuzz import process

df = pd.read_csv('companies.csv')
  1. First, you import pandas and thefuzz libraries. pandas handles data manipulation, while thefuzz is used for string matching.
  2. You then read a CSV file called companies.csv into a pandas DataFrame (df), which will store the list of companies.

Defining the normalize Function

def normalize(name):
    if pd.isna(name):  
        return ''
    name = name.lower().strip()
    name = name.replace('&', 'and')
    name = ''.join(e for e in name if e.isalnum() or e == ' ')
    return name

This function is crucial for cleaning and standardizing the company names:

  1. It first checks if the name is missing (Not a Number), and if so, returns an empty string.
  2. Next, it converts the name to lowercase and removes extra spaces from the start and end.
  3. It replaces the ampersand symbol (&) with “and”.
  4. Finally, it strips out all non-alphanumeric characters except spaces, leaving only letters, numbers, and spaces.

Defining the match_company Function

def match_company(name, target_df, limit=3):
    normalized = target_df['normalized_name'].tolist()
    matches = process.extract(name, normalized, scorer=fuzz.ratio, limit=limit)
    results = [(target_df.iloc[normalized.index(match)]['company_name'], score) for match, score in matches]
    return results

This function handles the company name matching:

  1. It takes the normalized input name and compares it to a list of normalized names from the DataFrame.
  2. The process.extract() function from thefuzz compares the input name with the list of normalized names using the fuzz.ratio similarity score.
  3. It returns the best limit number of matches ( Default is 3).
  4. For each match, it retrieves the original company name and the corresponding match score.

Normalizing the Company Names

df['normalized_name'] = df['company_name'].apply(normalize)
target_companies = df[['company_name', 'normalized_name']]
  1. Here, the program applies the normalize function to every company name in the DataFrame.
  2. It stores both the original company names and their normalized versions in target_companies

Getting User Input

user_input = input("Enter The Company Name You Want To Match: ").strip()
normalized_input = normalize(user_input)
  1. The program prompts the user to input a company name.
  2. After that, it normalizes the input using the normalize function to prepare it for comparison.

Matching and Displaying Results

if normalized_input:
    similar_matches = match_company(normalized_input, target_companies, limit=3)
    
    if similar_matches:
        best_match, best_score = similar_matches[0]
        print(f" Best Match Found: {best_match} With A Match Score Of {best_score}\n")
  1. If the input is not empty, the program calls the match_company function to find similar names.
  2. It checks if any matches are found:
    • If matches exist, it prints the best match (the one with the highest similarity score)

Asking to See Similar Matches

see_similar = input(" Would You Like To See Other Similar Company Names? (yes/no): ").lower()

if see_similar == 'yes':
    print("\n Here Are The Most Similar Company Names:")
    for i, (match, score) in enumerate(similar_matches):
        print(f"{i+1}. {match} (Match Score: {score})")
  1. The program asks the user if they want to see other similar company names.
  2. If the user answers “yes”, it displays the top 3 matches along with their similarity scores.

Selecting a Match

choice = input("\n Pick A Company By Entering The Number (1-3): ")

try:
    choice = int(choice) - 1
    if 0 <= choice < len(similar_matches):
        selected_match, selected_score = similar_matches[choice]
        print(f"\n You Selected: {selected_match} With A Match Score Of {selected_score}")
    else:
        print("\n Invalid Selection. Please Run The Program Again And Select A Valid Number.")
except ValueError:
    print("\n Invalid Input. Please Enter A Valid Number.")
  1. The user is prompted to select a company by entering the number corresponding to one of the displayed matches.
  2. If the user makes a valid selection, the chosen company name and its score are displayed.
  3. If the input is invalid (out of range or non-numeric), the program handles the error gracefully.

Handling No Matches or Empty Input

    else:
        print("\n No Close Matches Found. Please Try Again With A Different Name.")
else:
    print("Empty Input. Please Enter A Valid Company Name.")

If no close matches are found, or if the user input is empty, the program informs the user

Offering to View the Entire Dataset

view_dataset = input("\n Would You Like To See The Whole Dataset? (yes/no): ").lower()

if view_dataset == 'yes':
    print("Here Is The Entire Dataset:")
    print(df['company_name'])
else:
    print("Exiting The Program!")
  1. At the end, the program asks the user if they want to view the entire dataset.
  2. If the user chooses “yes”, it prints the list of all company names. If not, the program exits.

OUTPUT

Enter The Company Name You Want To Match: amazon i
Best Match Found: Amazon Inc. With A Match Score Of 89

Would You Like To See Other Similar Company Names? (yes/no): yes

Here Are The Most Similar Company Names:
1. Amazon Inc. (Match Score: 89)
2. Amazon.com, Inc. (Match Score: 76)
3. Amazon Web Services (AWS) (Match Score: 52)

Pick A Company By Entering The Number (1-3): 2

You Selected: Amazon.com, Inc. With A Match Score Of 76

Would You Like To See The Whole Dataset? (yes/no): yes
Here Is The Entire Dataset:
0 Google LLC
1 Google Inc.
2 Microsoft Corporation
3 Microsoft Corp.
4 Apple Inc.
5 Apple Computer, Inc.
6 Amazon.com, Inc.
7 Amazon Inc.
8 Amazon Web Services (AWS)
9 Meta Platforms, Inc.
10 Facebook, Inc.
11 Intel Corporation
12 Intel Corp
13 Mercedes-Benz Group AG
14 Mercedes-Benz USA LLC
15 Hyundai Motor Company
16 Hyundai Motors Corp.
17 The Walt Disney Company
18 Disney Corp.
19 Netflix, Inc.
20 Netflix Corporation
21 Spotify Technology S.A.
22 Spotify Inc
23 Vodafone Group PLC
24 Vodafone Inc.
25 Visa Inc.
26 Visa USA
Name: company_name, dtype: object

CONCLUSION

In conclusion, this program effectively cleans and normalizes company names, then uses fuzzy matching to find the closest matches to user input. It starts by handling missing data and standardizing the names for consistency. After receiving the user’s input, it identifies the top matches and allows further exploration of similar names. The program also offers the option to view the entire dataset. Through smooth transitions, it provides a user-friendly experience for matching and selecting company names.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top