Text Extraction From Image And Pdf Using Python and Tesseract

Text Extractor/

Text Extraction from both images and pdfs in Python using Tesseract OCR(handled by pytesseract)

Description:

Extracting texts from both images(.png and .jpg types) and pdfs using Tesseract OCR Engine.This Project can be further modified and implemented to extract particular texts from image or pdf.

Requirements:

Tesseract OCR:

installtion details for Tesseract : https://github.com/tesseract-ocr/tesseract/wiki#windows

(path to the folder must be defined in environment variables)

Pytesseract:

pip install pytesseract

PyMuPDF:

pip install PyMuPDF

Pillow:

pip install Pillow

Usage:

Go to the destined folder and open command prompt (terminal). From command prompt (terminal) type:

python text_extractor.py --file path_to_file

For example: python text_extractor.py --file test.pdf

Download Complete Code

Comments

No comments yet

Download Packet

Reviews Report

Submitted by Ruparna Mukherjee (rupu3097)

Download packets of source code on Coders Packet