Sentiment Analysis on Movie Reviews using Python

Sentiment_Analysis.ipynb

We train many different ML models with different Parameters on the rotten tomatoes dataset to classify reviews into 5 classes: negative, somewhat negative, neutral, somewhat positive, positive.

About the Dataset:

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
test.tsv contains just phrases. You must assign a sentiment label to each phrase.

The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

Preprocessing:

Split the data into train, validation and test data. Further, we split the sentences into words and we also remove unwanted words (called stopwords) as well as we remove the punctuations. After this, we then split the train, validation and test sets into their respective X and Y values so that we can use it for training the model.

Models:

We try many different models and each of them with different parameters so that we can know which model is the best for us.

Firstly we use Logistic Regression along with count vectorizer and we get the following accuracy and precision:

0.7304348754752445 0.7281238636213457