In this project, sentiment analysis of twitter data is carried out through binary classification of tweets in Python using Stochastic Gradient Descent Classifier.
This project was implemented in Python using NLP techniques and the Stochastic Gradient Descent model for classification.
The dataset used was taken from Kaggle.
The data consists of three columns which are -
The project has been implemented in Python and the dependencies necessary for the project include Pandas, Numpy, Seaborn, Matplotlib, NLTK, Regex, and Scikit-Learn.
Workflow of the project consists of the following stages -
1. Loading and exploring the dataset - It is essential to understand and explore the dataset before performing any manipulations on it. This allows for any necessary manipulation to be more precise and accurate and allows us to understand what happens to the data when we perform different actions on it.
2. Preprocessing - This is done in order to clean the data and make it more useful.
Here, the following are carried out -
1. Removal of punctuation, hashtags, and mentions
2. Removal of stopwords, and Lemmatization
3. Removal of links and numbers.
3. Feature Extraction and transformation of data - In this project, this is done with both CountVectorizer and TF-IDF Vectorizer and the results of classification after using them are compared.
4. Classification - This is done by using the Stochastic Gradient Descent Classifier. In this project, binary classification is done by using the SGD classifier and the tweets are classified as 0 or 1(negative or positive, respectively).
The SGD classifier is used along with both CountVectorizer and TF-IDF vectorizer.
1. Using CountVectorizer - Using this, an accuracy of 75% was achieved for this particular dataset.
2. Using TF - IDF Vectorizer - Using this, an accuracy of 72.8% was achieved for this particular dataset.
Classification reports for both the above methods were also generated in order to provide more insight into the efficiency of the model.
Important files in the project -
1. Binary Classification of tweets.py - Contains the source code for the project.
2. Dataset.csv - Contains the dataset used in the project.
Output - 1:
Output - 2 :