Language modelling in Python to classify tweet using Keras and GloVe embedding.

ac1.ipynb

This project is used to classify tweets on twitter whether they are related to a disaster or not based on the tweet contents. This Python project uses Keras framework and GloVe embedding.

In this project we will build a language model in python using keras which will determine whether a tweet is related to a disaster or not.

The project is written in Python and uses the following libraries:

1.NumPy

2.Matplotlib

3.Seaborn

4.tqdm

5.pandas

6.NLTK

7.scikit-learn

8.Keras.

The steps followed in the project are:

1. Exploring the dataset

2. Processing(cleaning) the dataset

3. Building the data pipeline for the model

4. Using Keras to build the model and training

5. Model evaluation and tuning.

STEPS:

To begin with, we first import all the required libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


from collections import  Counter

import re
import string
from collections import defaultdict
from tqdm import tqdm
from keras.initializers import Constant
from sklearn.model_selection import train_test_split
nltk.download('stopwords')
stop=set(stopwords.words('english'))

Then, let's import the dataset.

train = pd.read_csv('/content/dataset/train.csv')
test = pd.read_csv('/content/dataset/test.csv')

Then , we check the dataset attributes

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB

We now check for null values in the dataset

train.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

The tweets contaian a lot of information which we first need to clean like the emojis, HTML tags, urls, text like 'RT(retweet)' etc.

def clean(tweet):
  tweet = re.sub(r'https?://\S+|www\.\S+', '', tweet)
  tweet = re.sub(r'RT', '', tweet)
  tweet = re.sub(r'https?://\S+|www\.\S+','',tweet)
  tweet = re.sub(r'<.*?>','',tweet)
  tweet = re.sub("["
                           u"\U0001F600-\U0001F64F"  
                           u"\U0001F300-\U0001F5FF"  
                           u"\U0001F680-\U0001F6FF"  
                           u"\U0001F1E0-\U0001F1FF"  
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", '', tweet)
  table=str.maketrans('','',string.punctuation)
  cleaned = tweet.translate(table)
  return cleaned

train['text'] = train['text'].apply(lambda x: clean(x))

Now let's create the corpus which will be used for the tokenization

def create_corpus(df):
  corpus=[]
  for tweet in tqdm(df['text']):
    words=[word.lower() for word in word_tokenize(tweet) if((word.isalpha()==1) & (word not in stop))]
    corpus.append(words)
  
  return corpus

corpus = create_corpus(train)


100%|██████████| 7613/7613 [00:00<00:00, 7882.50it/s]

The next step is to tokenize the tweets, here we have used padding and max limit of the sentence is 50

tokenizer=Tokenizer()
tokenizer.fit_on_texts(corpus)
sequences=tokenizer.texts_to_sequences(corpus)

tweet_pad=pad_sequences(sequences,maxlen=50,truncating='post',padding='post')

Now it's time for the embedding. Here we are gonna use the GloVe embedding with 100 dimensions. Here the dictionary keys are the words and the values are the embeddings

embedding_dict={}
with open('/content/drive/My Drive/Embedding/glove_100d.txt','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

The next step is to get input data ready for the model

word_index=tokenizer.word_index

num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,100))

for word,i in tqdm(word_index.items()):
    if i > num_words:
        continue
    
    emb_vec=embedding_dict.get(word)
    if emb_vec is not None:
        embedding_matrix[i]=emb_vec 

train_tp=tweet_pad[:train.shape[0]]

X_train,X_test,y_train,y_test=train_test_split(train_tp,train['target'].values,test_size=0.10)

Now, the next step is building the model

import tensorflow.keras as keras

model = keras.Sequential([
                          keras.layers.Embedding(input_dim=num_words, output_dim=100, 
                                                             embeddings_initializer=Constant(embedding_matrix),input_length=50, trainable=False),
                          keras.layers.SpatialDropout1D(0.01),
                          keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2),
                          keras.layers.Dense(1, activation='sigmoid')
])

The optimizer that we will be using is Adam and as the labels are binary(1 or 0), the loss functon will be Binary_crossentropy

model.compile(optimizer=keras.optimizers.Adam(1e-5), loss=keras.losses.BinaryCrossentropy(),metrics=['accuracy'])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 50, 100)           1623900   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 50, 100)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 65        
=================================================================
Total params: 1,666,205
Trainable params: 42,305
Non-trainable params: 1,623,900
_________________________________________________________________

The training part

history=model.fit(X_train,y_train,batch_size=32,epochs=15,validation_data=(X_test,y_test),verbose=1)

Epoch 1/15
1713/1713 [==============================] - 310s 181ms/step - loss: 0.6742 - accuracy: 0.5954 - val_loss: 0.5665 - val_accuracy: 0.7625
Epoch 2/15
1713/1713 [==============================] - 313s 183ms/step - loss: 0.5458 - accuracy: 0.7597 - val_loss: 0.4996 - val_accuracy: 0.7927
Epoch 3/15
1713/1713 [==============================] - 311s 181ms/step - loss: 0.5206 - accuracy: 0.7678 - val_loss: 0.4758 - val_accuracy: 0.8018
Epoch 4/15
1713/1713 [==============================] - 302s 176ms/step - loss: 0.5063 - accuracy: 0.7722 - val_loss: 0.4655 - val_accuracy: 0.8045
Epoch 5/15
1713/1713 [==============================] - 302s 176ms/step - loss: 0.4939 - accuracy: 0.7773 - val_loss: 0.4559 - val_accuracy: 0.8084
Epoch 6/15
1713/1713 [==============================] - 309s 180ms/step - loss: 0.4862 - accuracy: 0.7832 - val_loss: 0.4497 - val_accuracy: 0.8058
Epoch 7/15
1713/1713 [==============================] - 309s 181ms/step - loss: 0.4767 - accuracy: 0.7929 - val_loss: 0.4440 - val_accuracy: 0.8045
Epoch 8/15
1713/1713 [==============================] - 309s 180ms/step - loss: 0.4750 - accuracy: 0.7921 - val_loss: 0.4397 - val_accuracy: 0.8071
Epoch 9/15
1713/1713 [==============================] - 310s 181ms/step - loss: 0.4772 - accuracy: 0.7860 - val_loss: 0.4371 - val_accuracy: 0.8097
Epoch 10/15
1713/1713 [==============================] - 309s 181ms/step - loss: 0.4648 - accuracy: 0.7971 - val_loss: 0.4329 - val_accuracy: 0.8071
Epoch 11/15
1713/1713 [==============================] - 308s 180ms/step - loss: 0.4654 - accuracy: 0.7962 - val_loss: 0.4308 - val_accuracy: 0.8150
Epoch 12/15
1713/1713 [==============================] - 310s 181ms/step - loss: 0.4640 - accuracy: 0.7945 - val_loss: 0.4287 - val_accuracy: 0.8176
Epoch 13/15
1713/1713 [==============================] - 304s 178ms/step - loss: 0.4577 - accuracy: 0.8002 - val_loss: 0.4267 - val_accuracy: 0.8163
Epoch 14/15
1713/1713 [==============================] - 289s 169ms/step - loss: 0.4576 - accuracy: 0.8028 - val_loss: 0.4241 - val_accuracy: 0.8150
Epoch 15/15
1713/1713 [==============================] - 288s 168ms/step - loss: 0.4585 - accuracy: 0.7973 - val_loss: 0.4249 - val_accuracy: 0.8123

plt.plot(history.history['accuracy'])

plt.plot(history.history['loss'])

Coders Packet

Language modelling in Python to classify tweet using Keras and GloVe embedding.

Comments