Coders Packet

Sentiment Analysis of Tweets using Binary Classification in Python

By Srinivasan

In this project, sentiment analysis of twitter data is carried out through binary classification of tweets in Python using Stochastic Gradient Descent Classifier.

Overview and Explanation of the Project

This project was implemented in Python using NLP techniques and the Stochastic Gradient Descent model for classification.

Brief explanation about the data and the dataset used - 

The dataset used was taken from Kaggle.

The data consists of three columns which are - 

1. ItemID - This field contains the ID of the tweet. The value for this field is numerical.
2. Sentiment - This field contains the sentiment of the particular tweet and the sentiment is represented by binary values. The value for this field can be either 0 or 1 where 0 represents negative and 1 represents positive.
3. SentimentText - This field contains the tweet itself in it's raw form with characters/numbers/letters/hashtags/mentions/urls.

Necessary Dependencies for the project - 

The project has been implemented in Python and the dependencies necessary for the project include Pandas, Numpy, Seaborn, Matplotlib, NLTK, Regex, and Scikit-Learn.

Workflow of the project consists of the following stages - 

1. Loading and exploring the dataset - It is essential to understand and explore the dataset before performing any manipulations on it. This allows for any necessary manipulation to be more precise and accurate and allows us to understand what happens to the data when we perform different actions on it.

2. Preprocessing - This is done in order to clean the data and make it more useful.
Here, the following are carried out -
1. Removal of punctuation, hashtags, and mentions
2. Removal of stopwords, and Lemmatization
3. Removal of links and numbers.

3. Feature Extraction and transformation of data - In this project, this is done with both CountVectorizer and TF-IDF Vectorizer and the results of classification after using them are compared.

4. Classification - This is done by using the Stochastic Gradient Descent Classifier. In this project, binary classification is done by using the SGD classifier and the tweets are classified as 0 or 1(negative or positive, respectively).

Performance of the model -

The SGD classifier is used along with both CountVectorizer and TF-IDF vectorizer.

1. Using CountVectorizer - Using this, an accuracy of 75% was achieved for this particular dataset.

2. Using TF - IDF Vectorizer - Using this, an accuracy of 72.8% was achieved for this particular dataset.

Classification reports for both the above methods were also generated in order to provide more insight into the efficiency of the model.

Important files in the project - 

1. Binary Classification of - Contains the source code for the project.
2. Dataset.csv - Contains the dataset used in the project.


Outputs :

Output - 1:

Output - 1

Output - 2 :

Output - 2

Download project

Reviews Report

Submitted by Srinivasan (Srinedhi)

Download packets of source code on Coders Packet