Naive Bayes Classification using sklearn in Python

Hey there! Ready to explore the world of classification using machine learning? In this tutorial, we’ll learn how to use scikit-learn(sklearn) in Python to perform Navie Bayes classification. Naive Bayes is a simple yet effective algorithm, perfect for text classification. Let’s get started!

Building a Naive Bayes Classification Model with sklearn

Step 1: Setting up the environment

First, we need to set up our environment and import the necessary libraries. We will use Pandas for data manipulation and scikit-learn for building our model.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the dataset
df = pd.read_csv("spam.csv", encoding='latin-1')
df = df[['v1', 'v2']]  # Selecting only the relevant columns
df.columns = ['label', 'text']
df.head()

Output:

 label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
Step 2: Exploring the Data

Next, let’s take a closer look at our data. This dataset contains SMS messages classified as ‘spam’ or ‘ham'(not spam). We will inspect the first few rows to understand its structure.

print(df.head())

Output:

 label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
Step 3: Preprocessing the Data

Before building our model, we need to prepare the data, We will convert the text data into numerical data using CountVectorizer and encode the labels.

# Encode the labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Convert text to numerical data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])

# Split the data into training and testing sets
y = df['label']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Output:

X shape: (5572, 8713)
y shape: (5572,)
Step 4: Building the Model

Now, we will create our Naive Bayes model. We will use the MultinomialNB class from sklearn, which is well-suited for text data.

# Initialize the model
model = MultinomialNB()

# Train the model
model.fit(x_train, y_train)

print("Model training completed.")

Output:

Model training completed.
Step 5: Evaluating the model

After training our model, we need to evaluate its performance on the test data. We will use accuracy score, confusion matrix, and classification report to assess how well our model performs.

# Make predictions on the test data
y_pred = model.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

# Generate classification report
class_report = classification_report(y_test, y_pred)
print('Classification Report:')
print(class_report)

Output:

Accuracy: 0.9879
Confusion Matrix:
[[955 0]
[ 13 147]]
Classification Report:
                precision     recall     f1-score     support

           0         0.99       1.00         0.99         955
           1         1.00       0.92         0.96         160

    accuracy                                 0.99        1115
   macro avg         0.99       0.96         0.98        1115
weighted avg         0.99       0.99         0.99        1115
Step 6: Fine-Tuning and Improvements

Finally, if your model’s performance is not as high as you’d like, don’t worry! You can go back and tweak it. Using different preprocessing techniques, adjusting the train-test split ratio, or experimenting with other types of Naive Bayes algorithms available in sklearn, such as BernoulliNB or GaussionNB can help improve the model.

Conclusion

We have walked through the entire process of building a Naive Bayes classification model using sklearn, from loading and preprocessing the data to building and evaluating the model. Naive Bayes is a powerful yet simple algorithm and with scikit-learn, it is easy to implement and start using it. Keep experimenting and improving your models and soon you will be a pro at classification with Naive Bayes!

Happy coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top