Text Preprocessing using Python by using glove.6B.50d word embeddings

nlp_preprocessing.py

NLP_preprocessing.ipynb

Given code packet do preprocessing like removing useless and less frequent words present in text data, vectorizing text and creating Embedding matrix (as NumPy array) in Python.

Given zip folder contains two files:

1. nlp_preprocessing.py- This script checks whether glove.6B.50d.txt is present in the current directory or not. If it is not present then the file is downloaded from http://nlp.stanford.edu/data/glove.6B.zip . Then it checks which of the train,val,test set is present and apply preproceesing.

2. NLP_preprocessing.ipynb- This file also provides the same functionality as the above file, but it runs as an interactive Python notebook.

File Format
1. All files should be present in the same directory of the code
2. Training data should be in train.csv, validation data should be in val.csv and test data should be in test.csv
2. If glove.6B.50d is already present in your system it should be of the format glove.6B.50d.txt

Input should include the name of the column which is needed to be transformed/preproccesed

Output Files Description
1. index_to_word.pkl: It contains mapping from integer label to words
2. word_to_index.pkl: It contains mapping from words to integer label
3. embed_matrix.npy: Its a NumPy array of embedding matrix prepared from glove embeddings
4. train_encoded.csv/val_encoded.csv/test_encoded.csv: It contain processed and vectorized form of the train/validation/test data

Coders Packet

Text Preprocessing using Python by using glove.6B.50d word embeddings

Comments