Spam Classifier using Natural Language Processing in Python

spam.py

SMSSpamCollection

The words are formatted using the stemming process and a bag of words is created which is then trained using Naive Bayes which gives good accuracy to predict the message is spam or not in Python.

Datasets -

The dataset has been taken from UCI which consist of a tab separated file with only 2 columns. You can download it form here.

Implementation -

Step 1 -

With the help of pandas library read the tab separated file and provide the name to the columns as class and Text messages.

Step 2 -

Now comes the data cleaning and preprocessing parts, import re and nltk libraries for removing special characters with space and also all the stopping words (eg. a,he, is the, etc) and then lowering the sentences thus formed. Split and stem all the words and join all the words to the sentences and store thus formed sentences into a new list.

Stemming can be done by -

rev = [ps.stem(word) for word in rev if not word in stopwords.words('english')]

Step 3 -

Create a bag of words with help of CountVectorizer which needs to be imported from sklearn.feature_extraction.text library and store it to x variable and now store the class values in a binary format and store it to y variable.

Step 4 -

Now from sklearn library train, test and split your x and y variables. After that pass you x_train and y_train values to the Naive Bayes model to get y_pred value. To check the results of this model you can use confusion matrix or accuracy score or both on y_test and y_pred variables to get your respective values.

Conclusion -

This model provided with an accuracy of 0.9875 which means it worked correctly for 1098 sentences and incorrectly for just 17 sentences which can be inferred from confusion matrix.

Coders Packet

Spam Classifier using Natural Language Processing in Python

Datasets -

Implementation -

Conclusion -

Comments