Hey there! Ready to explore the world of classification using machine learning? In this tutorial, we’ll learn how to use scikit-learn(sklearn) in Python to perform Navie Bayes classification. Naive Bayes is a simple yet effective algorithm, perfect for text classification. Let’s get started!
Building a Naive Bayes Classification Model with sklearn
Step 1: Setting up the environment
First, we need to set up our environment and import the necessary libraries. We will use Pandas for data manipulation and scikit-learn for building our model.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Load the dataset df = pd.read_csv("spam.csv", encoding='latin-1') df = df[['v1', 'v2']] # Selecting only the relevant columns df.columns = ['label', 'text'] df.head()
Output:
label text 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro...
Step 2: Exploring the Data
Next, let’s take a closer look at our data. This dataset contains SMS messages classified as ‘spam’ or ‘ham'(not spam). We will inspect the first few rows to understand its structure.
print(df.head())
Output:
label text 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro...
Step 3: Preprocessing the Data
Before building our model, we need to prepare the data, We will convert the text data into numerical data using CountVectorizer and encode the labels.
# Encode the labels df['label'] = df['label'].map({'ham': 0, 'spam': 1}) # Convert text to numerical data vectorizer = CountVectorizer() X = vectorizer.fit_transform(df['text']) # Split the data into training and testing sets y = df['label'] x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Output:
X shape: (5572, 8713) y shape: (5572,)
Step 4: Building the Model
Now, we will create our Naive Bayes model. We will use the MultinomialNB class from sklearn, which is well-suited for text data.
# Initialize the model model = MultinomialNB() # Train the model model.fit(x_train, y_train) print("Model training completed.")
Output:
Model training completed.
Step 5: Evaluating the model
After training our model, we need to evaluate its performance on the test data. We will use accuracy score, confusion matrix, and classification report to assess how well our model performs.
# Make predictions on the test data y_pred = model.predict(x_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.4f}') # Generate confusion matrix conf_matrix = confusion_matrix(y_test, y_pred) print('Confusion Matrix:') print(conf_matrix) # Generate classification report class_report = classification_report(y_test, y_pred) print('Classification Report:') print(class_report)
Output:
Accuracy: 0.9879 Confusion Matrix: [[955 0] [ 13 147]] Classification Report: precision recall f1-score support 0 0.99 1.00 0.99 955 1 1.00 0.92 0.96 160 accuracy 0.99 1115 macro avg 0.99 0.96 0.98 1115 weighted avg 0.99 0.99 0.99 1115
Step 6: Fine-Tuning and Improvements
Finally, if your model’s performance is not as high as you’d like, don’t worry! You can go back and tweak it. Using different preprocessing techniques, adjusting the train-test split ratio, or experimenting with other types of Naive Bayes algorithms available in sklearn, such as BernoulliNB or GaussionNB can help improve the model.
Conclusion
We have walked through the entire process of building a Naive Bayes classification model using sklearn, from loading and preprocessing the data to building and evaluating the model. Naive Bayes is a powerful yet simple algorithm and with scikit-learn, it is easy to implement and start using it. Keep experimenting and improving your models and soon you will be a pro at classification with Naive Bayes!
Happy coding!