Image Text Searcher Using Python & Tesseract

By Utsav Tripathi

This project aims to create a tool using Python & Tesseract OCR that identifies words in any given image and counts the occurences of any given target word.

The main aim of this project is to create a tool for extracting data from any given image which has been successfully achieved.

This project:

1. reads an image with text data

2. identifies text using Tesseract OCR

3. searches for the target word.

4. prints the number of occurences.


An additional use-case of this project can be a CSV Convertor for reading an tabular image and output that data as a Comma Seperated Value (CSV) file.

This project has been attempted in Python using the Tesseract OCR module.

FUTURE SCOPE (Shortcomings still to be overcome (For CSV Convertor)):
1. The data extracted is in the form of a long string. Therefore it is imperative that a way be developed to isolate the columns and the content therein.

2. The most basic attempt was to create a list of words detected from the string, but as mentioned, the list too had the same shortcoming.

3. Further steps may include:

          1. Using OpenCV to create a bounding box around the columns and subsequently the cells, so as to identify the data in a cell as accurately as possible.

          2. Using Artificial Neural Networks to train a model to identify cells in an image and read them in a LEFT-RIGHT, TOP-BOTTOM order and create a list accrdingly.

4. The second approach, i.e, 3.2 is the most promising approach but not the only approach. 

