IMDb Dataset Review Classification in Python using TensorFlow

Movie_Review.py

In this project, we will make use of TensorFlow to create a simple NLP model to predict if a movie review is positive or negative.

Natural Language Processing(NLP) is a new, upcoming field in Deep Learning and is of great importance as it helps deal with text/language-related data.

Few common tasks in NLP include natural language generation, sentiment analysis and sequence completion. In this project, we are going to analyze movie reviews from an IMDb dataset and classify them as positive or negative.

Dataset

We are going to make use of TensorFlow's datasets library. In this, we are going to be using the IMDb reviews dataset.

Before we get started, let us first install the required dependencies.

Installing Dependencies

We will require TensorFlow for our dataset as well as the creation of our model. Apart from this, we will make use of the NumPy library.

To install these dependencies, open your command prompt or terminal and type the following commands:

pip install tensorflow
pip install numpy

Importing Libraries

Let us import the required libraries:

import tensorflow as tf
from tensorflow_datasets import tfds
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Next, let us load our dataset.

Loading the dataset and Preparing data

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

Our data is contained in the 'imdb' variable. Now, we need to split this into train and test sets.

This can be done as follows:

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
  
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

There, our data is almost ready. We still have a little preprocessing to do. We need to tokenize our data, i.e convert the words to numbers/vectorize the words so that they can be stored as numpy arrays and be fed into our model. Also, we need to use the dataset to create a vocabulary based on which our language model can be trained.

Preparing the vocabulary and tokenization

We can use the following code snippet to tokenize our data and prepare our vocabulary:

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = ""

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok) # Create a tokenizer 
tokenizer.fit_on_texts(training_sentences) # Tokenize words
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences) # Converting the sentences into tokens
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type) # Padding to ensure equal length of all sentences

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

Additionally, let us create a function that can be used to convert the tokenized data to words. We do this so that it is easier to understand and make changes to our model if needed. The function is as follows:

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) 

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Creating the model

We will make a simple model using an Embedding layer for our vocabulary and a Gated Recurrent Unit (GRU) layer which will be bidirectional. A GRU is better than a normal Recurrent Unit layer as it tackles the vanishing gradient problem better. There are more advanced architectures like LSTMs, Transformers etc. but we won't be getting into that as of now. Anyways, our model will be implemented something like this:

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

# Training the model
num_epochs = 30
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

We use the well-known Adam optimizer for our training weights. The type of loss is binary cross-entropy as we only have to classify the reviews as either good or bad, i.e a binary classification. The model is run for about 30 epochs. This model will take some time to run. If you have access to a GPU, feel free to use it. The training should be complete in about 10 minutes. We obtain an almost perfect training accuracy of 100% with a very good loss of about 10^-6. However, there is certain overfitting in the model as the validation accuracy fluctuates in the low 80s. The loss varies as well, moving up and down. You can expect a validation accuracy of about ~80%. This is not too shabby considering we ran the model only once and the architecture was very simple and straightforward. Also, we made use of a very small vocabulary. Taking all these into account, we have managed to come up with a decent model.

You can further improve the model by making use of LSTMs and Transformers. You can also make use of pre-trained weights and vocabularies to improve your model.

Coders Packet