Text Extraction from both images and pdfs in Python using Tesseract OCR(handled by pytesseract)
Extracting texts from both images(.png and .jpg types) and pdfs using Tesseract OCR Engine.This Project can be further modified and implemented to extract particular texts from image or pdf.
Tesseract OCR:
installtion details for Tesseract : https://github.com/tesseract-ocr/tesseract/wiki#windows
(path to the folder must be defined in environment variables)
Pytesseract:
pip install pytesseract
PyMuPDF:
pip install PyMuPDF
Pillow:
pip install Pillow
Go to the destined folder and open command prompt (terminal). From command prompt (terminal) type:
python text_extractor.py --file path_to_file
For example: python text_extractor.py --file test.pdf
Submitted by Ruparna Mukherjee (rupu3097)
Download packets of source code on Coders Packet
Comments