In this project, we will make use of TensorFlow to create a simple NLP model to predict if a movie review is positive or negative.
Natural Language Processing(NLP) is a new, upcoming field in Deep Learning and is of great importance as it helps deal with text/language-related data.
Few common tasks in NLP include natural language generation, sentiment analysis and sequence completion. In this project, we are going to analyze movie reviews from an IMDb dataset and classify them as positive or negative.
We are going to make use of TensorFlow's datasets library. In this, we are going to be using the IMDb reviews dataset.
Before we get started, let us first install the required dependencies.
We will require TensorFlow for our dataset as well as the creation of our model. Apart from this, we will make use of the NumPy library.
To install these dependencies, open your command prompt or terminal and type the following commands:
pip install tensorflow pip install numpy
Let us import the required libraries:
import tensorflow as tf from tensorflow_datasets import tfds
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
Next, let us load our dataset.
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)
Our data is contained in the 'imdb' variable. Now, we need to split this into train and test sets.
This can be done as follows:
train_data, test_data = imdb['train'], imdb['test'] training_sentences = [] training_labels = [] testing_sentences = [] testing_labels = [] # str(s.tonumpy()) is needed in Python3 instead of just s.numpy() for s,l in train_data: training_sentences.append(str(s.numpy())) training_labels.append(l.numpy()) for s,l in test_data: testing_sentences.append(str(s.numpy())) testing_labels.append(l.numpy()) training_labels_final = np.array(training_labels) testing_labels_final = np.array(testing_labels)
There, our data is almost ready. We still have a little preprocessing to do. We need to tokenize our data, i.e convert the words to numbers/vectorize the words so that they can be stored as numpy arrays and be fed into our model. Also, we need to use the dataset to create a vocabulary based on which our language model can be trained.
We can use the following code snippet to tokenize our data and prepare our vocabulary:
vocab_size = 10000 embedding_dim = 16 max_length = 120 trunc_type='post' oov_tok = "" tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok) # Create a tokenizer tokenizer.fit_on_texts(training_sentences) # Tokenize words word_index = tokenizer.word_index sequences = tokenizer.texts_to_sequences(training_sentences) # Converting the sentences into tokens padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type) # Padding to ensure equal length of all sentences testing_sequences = tokenizer.texts_to_sequences(testing_sentences) testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
Additionally, let us create a function that can be used to convert the tokenized data to words. We do this so that it is easier to understand and make changes to our model if needed. The function is as follows:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) def decode_review(text): return ' '.join([reverse_word_index.get(i, '?') for i in text])
We will make a simple model using an Embedding layer for our vocabulary and a Gated Recurrent Unit (GRU) layer which will be bidirectional. A GRU is better than a normal Recurrent Unit layer as it tackles the vanishing gradient problem better. There are more advanced architectures like LSTMs, Transformers etc. but we won't be getting into that as of now. Anyways, our model will be implemented something like this:
model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)), tf.keras.layers.Dense(6, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) model.summary()
# Training the model
num_epochs = 30
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
We use the well-known Adam optimizer for our training weights. The type of loss is binary cross-entropy as we only have to classify the reviews as either good or bad, i.e a binary classification. The model is run for about 30 epochs. This model will take some time to run. If you have access to a GPU, feel free to use it. The training should be complete in about 10 minutes. We obtain an almost perfect training accuracy of 100% with a very good loss of about 10^-6. However, there is certain overfitting in the model as the validation accuracy fluctuates in the low 80s. The loss varies as well, moving up and down. You can expect a validation accuracy of about ~80%. This is not too shabby considering we ran the model only once and the architecture was very simple and straightforward. Also, we made use of a very small vocabulary. Taking all these into account, we have managed to come up with a decent model.
You can further improve the model by making use of LSTMs and Transformers. You can also make use of pre-trained weights and vocabularies to improve your model.
This Python Script was written and executed in a Jupyter Notebook.
Submitted by Praatibh Surana (praatibhsurana)
Download packets of source code on Coders Packet
Comments