Coders Packet

Text Extraction From Image And Pdf Using Python and Tesseract

By Ruparna Mukherjee

Text Extraction from both images and pdfs in Python using Tesseract OCR(handled by pytesseract)

Description:

Extracting texts from both images(.png and .jpg types) and pdfs using Tesseract OCR Engine.This Project can be further modified and implemented to extract particular texts from image or pdf.

 

Requirements:

Tesseract OCR:

       installtion details for Tesseract : https://github.com/tesseract-ocr/tesseract/wiki#windows 

       (path to the folder must be defined in environment variables)

Pytesseract:

       pip install pytesseract

PyMuPDF:

      pip install PyMuPDF

Pillow:

      pip install Pillow

 

Usage:

Go to the destined folder and open command prompt (terminal). From command prompt (terminal) type:

python text_extractor.py --file path_to_file

For example: python text_extractor.py --file test.pdf

Download Complete Code

Comments

No comments yet

Download Packet

Reviews Report

Submitted by Ruparna Mukherjee (rupu3097)

Download packets of source code on Coders Packet