Natural Language Processing in Python using Scikit-Learn

OCR of images to recognize text using Python/

Implementing Natural Langauge Processing in Python using the Natural Language ToolKit library, Naive Bayes classifier from Scikit-learn, and the concept of TF-IDF for normalization.

The project is being coded in Python3. In this project, we will be performing natural language processing using NLTK(Natural Language ToolKit), which is a library for performing symbolic and statistical NLP in the English language written in Python.

For weighting and normalization, the TF-IDF method will be used, this will be achieved by using scikit-learn's TfidfTransformer.

So what basically TF-IDF method is that we will be computing tf-idf weight, which is a weight which is often being used in information retrieval. This weight is used to evaluate the importance of words in the document to a document in a collection or corpus.

Installing Dependencies

pip install nltk
pip install pandas
pip install matplotlib
pip install seaborn 
pip install scikit-learn

we will also need to install something called stopwords under corpus from nltk, for that do the following:

within Python CLI run

import nltk
nltk.download()

after this, a pop-up appears

First, click on Corpora and the select stopwords, and after that click on Download. As I have already installed it on my system, its shows installed.

You can even carry out the above installation through ipynb also, it has been specified there also.

Coders Packet

Natural Language Processing in Python using Scikit-Learn

Installing Dependencies

Comments