By Mihir Shri
This Python packett is about detecting whether an e-mail is spam or ham (not spam) using the Naive Bayes algorithm and supervised Machine Learning techniques.
This project detects whether an e-mail is spam or ham (not spam).
The user needs to enter:
1. The whole e-mail as a text.
This has been implemented using the Naive Bayes algorithm available in Python's sklearn library which is a library for Machine Learning in Python. It can be imported using the following line of code:
from sklearn import naive_bayes
A spam.csv file is also attached which contains the data of about 5500 spam and ham e-mails along with labels. The csv file is read into the code using the read_csv('filename') function of the famous pandas library. The function takes in the filename as an input and converts the contents of the csv file into a DataFrame.
import pandas as pd df = pd.read_csv('spam.csv')
The text handling has been done using the CountVectorizer() function which is also available in the sklearn library which counts the number of times a particular word appears in each document and does this for all the words. The Pipleline() feature creates the objects of the passed classes and applies them to our model.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline model = Pipeline([('vectorizer', CountVectorizer()), ('nb', naive_bayes.MultinomialNB())])
The data has been split into train and test sets with an 80-20 split and has been trained using the train set. The model is trained using the model.fit() function and accuracy has been calculated using model.score() function. The predictions are made using the model.predict() function.
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array(df.Message)
y = np.array(df.spam)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model.fit(X_train, y_train) accuracy = model.score(X_test, y_test) y = model.predict(emails)
By using the Naive Bayes classifier, we can achieve an accuracy of around 98% on our test set which is pretty awesome.
Note: The user first needs to install the "sklearn" library by using "pip install sklearn" or by reading the official sklearn documentation if not using pip.
Submitted by Mihir Shri (coe18b064)
Download packets of source code on Coders Packet
Comments