Sentiment Analysis in Python using LinearSVC

Here we will try to categorize sentiments for the IMDB dataset available on kaggle using Support Vector Machines in Python

Here we will try to do a simple Sentiment Analysis on the IMDB review dataset provided on twitter using Support vector machines in Python. The dataset is quite big and is apt for the SVM to work. This problem could also be approached generally by using RNN's and LSTM's but in this approach, we will approach using Linear SVC.

Dataset

The dataset consists of reviews taken directly from IMDB reviews. The dataset can be found here imdb review. The data is messy and unclean hence we need to clean the data before inserting it in the svm.

Installing Dependencies

Before importing libraries we need to install the dependencies.

pip install numpy 
pip install pandas
pip install re

Importing Libraries

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

Load the dataset

Check how the dataset looks like by printing the top 5 reviews.

df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df.head()

Clean the data

Using the library stopwords we can remove stopwords(stopwords are generally filtered out before processing in NLP).

We can see that reviews are unclean and contain strings like "
" and unnecessary punctuations which need to be filtered for which we use 're' library which is generally used to cleaning texts.

def clean_rev(rev, remove_stopwords =True):
    rev = rev.lower().split()
    
    if remove_stopwords:
        stop = set(stopwords.words("english"))
        rev = [w for w in rev if not w in stop]
    rev = " ".join(rev)
    
    rev = re.sub(r"
", " ", rev)
    rev = re.sub(r"[^a-z]", " ", rev)
    rev = re.sub(r"   ", " ", rev) 
    rev = re.sub(r"  ", " ", rev)
    
    return (rev)

df['review'] = df['review'].apply(clean_rev)
df.head()

Applying the function and then checking for the first 5 reviews.

Encoding the sentiment

Since the sentiment is divided into two it will be easier if we just encode them into numbers so that classification becomes easy.

encoder = LabelEncoder()
df['sentiment'] = encoder.fit_transform(df['sentiment'])
df.head()

This changes positive to 1 and negative 0.

Train and Test split

Split the data into test and train here we have a total of 50,000 reviews we will split it into 40,000 for training and 10,000 for test.

rev = df['review']
sent = df['sentiment']

x_train, x_test, y_train, y_test = train_test_split(rev, sent, test_size = 0.2, random_state = 0)

Tokenize

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words.

train = []
test  = []

for i in x_train.index:
    temp=x_train[i]
    train.append(temp)

for j in x_test.index:
    temp1=x_test[j]
    test.append(temp1)

cv = CountVectorizer()
cv_train = cv.fit_transform(train)
cv_test = cv.transform(test)

Support Vector Machines

Svm's are supervised learning models running on algorithms, generally used for classification and linear regression models. Here we will use Linear SVM as we only have two values we need to classify for.

svc=LinearSVC(random_state= 0 ,max_iter=15000)
svc.fit(cv_train,y_train)

y_pred=svc.predict(cv_test)

keeping the random_state same as the train_test_split so that the random states match.

Fitting the model on our training sets and running for 15,000 iterations.

Finally testing the model by predicting the test values.

Comparing the test and predicted values

print(classification_report(y_test, y_pred))
print("Accuracy is",accuracy_score(y_test, y_pred))

Finally, we compare the predicted and actual value and find that the model is 86% accurate. If you want to improve the accuracy you can go for cleaning the data better and removing the words that occur once or twice tp improve vocab.

Coders Packet

Sentiment Analysis in Python using LinearSVC

Comments