This project is used to classify tweets on twitter whether they are related to a disaster or not based on the tweet contents. This Python project uses Keras framework and GloVe embedding.
In this project we will build a language model in python using keras which will determine whether a tweet is related to a disaster or not.
The project is written in Python and uses the following libraries:
1.NumPy
2.Matplotlib
3.Seaborn
4.tqdm
5.pandas
6.NLTK
7.scikit-learn
8.Keras.
The steps followed in the project are:
1. Exploring the dataset
2. Processing(cleaning) the dataset
3. Building the data pipeline for the model
4. Using Keras to build the model and training
5. Model evaluation and tuning.
STEPS:
To begin with, we first import all the required libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import nltk from nltk.corpus import stopwords from nltk.util import ngrams from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import CountVectorizer from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from collections import Counter import re import string from collections import defaultdict from tqdm import tqdm from keras.initializers import Constant from sklearn.model_selection import train_test_split nltk.download('stopwords') stop=set(stopwords.words('english'))
Then, let's import the dataset.
train = pd.read_csv('/content/dataset/train.csv') test = pd.read_csv('/content/dataset/test.csv')
Then , we check the dataset attributes
train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 7613 entries, 0 to 7612 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 7613 non-null int64 1 keyword 7552 non-null object 2 location 5080 non-null object 3 text 7613 non-null object 4 target 7613 non-null int64 dtypes: int64(2), object(3) memory usage: 297.5+ KB
We now check for null values in the dataset
train.isna().sum() id 0 keyword 61 location 2533 text 0 target 0 dtype: int64
The tweets contaian a lot of information which we first need to clean like the emojis, HTML tags, urls, text like 'RT(retweet)' etc.
def clean(tweet): tweet = re.sub(r'https?://\S+|www\.\S+', '', tweet) tweet = re.sub(r'RT', '', tweet) tweet = re.sub(r'https?://\S+|www\.\S+','',tweet) tweet = re.sub(r'<.*?>','',tweet) tweet = re.sub("[" u"\U0001F600-\U0001F64F" u"\U0001F300-\U0001F5FF" u"\U0001F680-\U0001F6FF" u"\U0001F1E0-\U0001F1FF" u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251" "]+", '', tweet) table=str.maketrans('','',string.punctuation) cleaned = tweet.translate(table) return cleaned
train['text'] = train['text'].apply(lambda x: clean(x))
Now let's create the corpus which will be used for the tokenization
def create_corpus(df): corpus=[] for tweet in tqdm(df['text']): words=[word.lower() for word in word_tokenize(tweet) if((word.isalpha()==1) & (word not in stop))] corpus.append(words) return corpus corpus = create_corpus(train) 100%|██████████| 7613/7613 [00:00<00:00, 7882.50it/s]
The next step is to tokenize the tweets, here we have used padding and max limit of the sentence is 50
tokenizer=Tokenizer() tokenizer.fit_on_texts(corpus) sequences=tokenizer.texts_to_sequences(corpus) tweet_pad=pad_sequences(sequences,maxlen=50,truncating='post',padding='post')
Now it's time for the embedding. Here we are gonna use the GloVe embedding with 100 dimensions. Here the dictionary keys are the words and the values are the embeddings
embedding_dict={} with open('/content/drive/My Drive/Embedding/glove_100d.txt','r') as f: for line in f: values=line.split() word=values[0] vectors=np.asarray(values[1:],'float32') embedding_dict[word]=vectors f.close()
The next step is to get input data ready for the model
word_index=tokenizer.word_index num_words=len(word_index)+1 embedding_matrix=np.zeros((num_words,100)) for word,i in tqdm(word_index.items()): if i > num_words: continue emb_vec=embedding_dict.get(word) if emb_vec is not None: embedding_matrix[i]=emb_vec train_tp=tweet_pad[:train.shape[0]]
X_train,X_test,y_train,y_test=train_test_split(train_tp,train['target'].values,test_size=0.10)
Now, the next step is building the model
import tensorflow.keras as keras model = keras.Sequential([ keras.layers.Embedding(input_dim=num_words, output_dim=100, embeddings_initializer=Constant(embedding_matrix),input_length=50, trainable=False), keras.layers.SpatialDropout1D(0.01), keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2), keras.layers.Dense(1, activation='sigmoid') ])
The optimizer that we will be using is Adam and as the labels are binary(1 or 0), the loss functon will be Binary_crossentropy
model.compile(optimizer=keras.optimizers.Adam(1e-5), loss=keras.losses.BinaryCrossentropy(),metrics=['accuracy'])
model.summary() Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 50, 100) 1623900 _________________________________________________________________ spatial_dropout1d_2 (Spatial (None, 50, 100) 0 _________________________________________________________________ lstm_2 (LSTM) (None, 64) 42240 _________________________________________________________________ dense_4 (Dense) (None, 1) 65 ================================================================= Total params: 1,666,205 Trainable params: 42,305 Non-trainable params: 1,623,900 _________________________________________________________________
The training part
history=model.fit(X_train,y_train,batch_size=32,epochs=15,validation_data=(X_test,y_test),verbose=1) Epoch 1/15 1713/1713 [==============================] - 310s 181ms/step - loss: 0.6742 - accuracy: 0.5954 - val_loss: 0.5665 - val_accuracy: 0.7625 Epoch 2/15 1713/1713 [==============================] - 313s 183ms/step - loss: 0.5458 - accuracy: 0.7597 - val_loss: 0.4996 - val_accuracy: 0.7927 Epoch 3/15 1713/1713 [==============================] - 311s 181ms/step - loss: 0.5206 - accuracy: 0.7678 - val_loss: 0.4758 - val_accuracy: 0.8018 Epoch 4/15 1713/1713 [==============================] - 302s 176ms/step - loss: 0.5063 - accuracy: 0.7722 - val_loss: 0.4655 - val_accuracy: 0.8045 Epoch 5/15 1713/1713 [==============================] - 302s 176ms/step - loss: 0.4939 - accuracy: 0.7773 - val_loss: 0.4559 - val_accuracy: 0.8084 Epoch 6/15 1713/1713 [==============================] - 309s 180ms/step - loss: 0.4862 - accuracy: 0.7832 - val_loss: 0.4497 - val_accuracy: 0.8058 Epoch 7/15 1713/1713 [==============================] - 309s 181ms/step - loss: 0.4767 - accuracy: 0.7929 - val_loss: 0.4440 - val_accuracy: 0.8045 Epoch 8/15 1713/1713 [==============================] - 309s 180ms/step - loss: 0.4750 - accuracy: 0.7921 - val_loss: 0.4397 - val_accuracy: 0.8071 Epoch 9/15 1713/1713 [==============================] - 310s 181ms/step - loss: 0.4772 - accuracy: 0.7860 - val_loss: 0.4371 - val_accuracy: 0.8097 Epoch 10/15 1713/1713 [==============================] - 309s 181ms/step - loss: 0.4648 - accuracy: 0.7971 - val_loss: 0.4329 - val_accuracy: 0.8071 Epoch 11/15 1713/1713 [==============================] - 308s 180ms/step - loss: 0.4654 - accuracy: 0.7962 - val_loss: 0.4308 - val_accuracy: 0.8150 Epoch 12/15 1713/1713 [==============================] - 310s 181ms/step - loss: 0.4640 - accuracy: 0.7945 - val_loss: 0.4287 - val_accuracy: 0.8176 Epoch 13/15 1713/1713 [==============================] - 304s 178ms/step - loss: 0.4577 - accuracy: 0.8002 - val_loss: 0.4267 - val_accuracy: 0.8163 Epoch 14/15 1713/1713 [==============================] - 289s 169ms/step - loss: 0.4576 - accuracy: 0.8028 - val_loss: 0.4241 - val_accuracy: 0.8150 Epoch 15/15 1713/1713 [==============================] - 288s 168ms/step - loss: 0.4585 - accuracy: 0.7973 - val_loss: 0.4249 - val_accuracy: 0.8123
plt.plot(history.history['accuracy'])
plt.plot(history.history['loss'])
Submitted by Saurabh Damle (saurabhdamle)
Download packets of source code on Coders Packet
Comments