By Mirza Yusuf
Here we will try to categorize sentiments for the IMDB dataset available on kaggle using Support Vector Machines in Python
Here we will try to do a simple Sentiment Analysis on the IMDB review dataset provided on twitter using Support vector machines in Python. The dataset is quite big and is apt for the SVM to work. This problem could also be approached generally by using RNN's and LSTM's but in this approach, we will approach using Linear SVC.
The dataset consists of reviews taken directly from IMDB reviews. The dataset can be found here imdb review. The data is messy and unclean hence we need to clean the data before inserting it in the svm.
Before importing libraries we need to install the dependencies.
pip install numpy pip install pandas pip install re
import pandas as pd import numpy as np import re from nltk.corpus import stopwords from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score
Load the dataset
Check how the dataset looks like by printing the top 5 reviews.
df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv') df.head()
Clean the data
Using the library stopwords we can remove stopwords(stopwords are generally filtered out before processing in NLP).
We can see that reviews are unclean and contain strings like "
" and unnecessary punctuations which need to be filtered for which we use 're' library which is generally used to cleaning texts.
def clean_rev(rev, remove_stopwords =True): rev = rev.lower().split() if remove_stopwords: stop = set(stopwords.words("english")) rev = [w for w in rev if not w in stop] rev = " ".join(rev) rev = re.sub(r"
", " ", rev) rev = re.sub(r"[^a-z]", " ", rev) rev = re.sub(r" ", " ", rev) rev = re.sub(r" ", " ", rev) return (rev)
df['review'] = df['review'].apply(clean_rev)
Applying the function and then checking for the first 5 reviews.
Encoding the sentiment
Since the sentiment is divided into two it will be easier if we just encode them into numbers so that classification becomes easy.
encoder = LabelEncoder() df['sentiment'] = encoder.fit_transform(df['sentiment']) df.head()
This changes positive to 1 and negative 0.
Train and Test split
Split the data into test and train here we have a total of 50,000 reviews we will split it into 40,000 for training and 10,000 for test.
rev = df['review'] sent = df['sentiment'] x_train, x_test, y_train, y_test = train_test_split(rev, sent, test_size = 0.2, random_state = 0)
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words.
train =  test =  for i in x_train.index: temp=x_train[i] train.append(temp) for j in x_test.index: temp1=x_test[j] test.append(temp1) cv = CountVectorizer() cv_train = cv.fit_transform(train) cv_test = cv.transform(test)
Support Vector Machines
Svm's are supervised learning models running on algorithms, generally used for classification and linear regression models. Here we will use Linear SVM as we only have two values we need to classify for.
svc=LinearSVC(random_state= 0 ,max_iter=15000) svc.fit(cv_train,y_train) y_pred=svc.predict(cv_test)
keeping the random_state same as the train_test_split so that the random states match.
Fitting the model on our training sets and running for 15,000 iterations.
Finally testing the model by predicting the test values.
Comparing the test and predicted values
print(classification_report(y_test, y_pred)) print("Accuracy is",accuracy_score(y_test, y_pred))
Finally, we compare the predicted and actual value and find that the model is 86% accurate. If you want to improve the accuracy you can go for cleaning the data better and removing the words that occur once or twice tp improve vocab.