This project aims to create a tool using Python & Tesseract OCR that identifies words in any given image and counts the occurences of any given target word.
The main aim of this project is to create a tool for extracting data from any given image which has been successfully achieved.
1. reads an image with text data
2. identifies text using Tesseract OCR
3. searches for the target word.
4. prints the number of occurences.
An additional use-case of this project can be a CSV Convertor for reading an tabular image and output that data as a Comma Seperated Value (CSV) file.
This project has been attempted in Python using the Tesseract OCR module.
FUTURE SCOPE (Shortcomings still to be overcome (For CSV Convertor)):
1. The data extracted is in the form of a long string. Therefore it is imperative that a way be developed to isolate the columns and the content therein.
2. The most basic attempt was to create a list of words detected from the string, but as mentioned, the list too had the same shortcoming.
3. Further steps may include:
1. Using OpenCV to create a bounding box around the columns and subsequently the cells, so as to identify the data in a cell as accurately as possible.
2. Using Artificial Neural Networks to train a model to identify cells in an image and read them in a LEFT-RIGHT, TOP-BOTTOM order and create a list accrdingly.
4. The second approach, i.e, 3.2 is the most promising approach but not the only approach.