Coders Packet

Text Extraction from live frame using Tesseract in Python

By Rachit R Jindal

This packet involves the use of Pytesseract, a powerful OCR(Optical Character Recognition) tool in python, for the classification of characters in a live frame.

This packet uses two Python modules, cv2 for capturing and reading the live stream/frame, Pytesseract for character recognition and extraction. To continue first install tesseract in a virtual environment.

import pytesseract
from pytesseract import Output
import cv2
from PIL import Image

Pytesseract Output function is used to get text information related to its alignment(left, right, width, height), accuracy, language, etc. Cv2 is for frame reading, processing, and optimization for easier classification.

Capturing, Reading, and Processing frame:

cap = cv2.VideoCapture(0)

while True:

    # Reading the frame
    ret,frame =

    # Converting into gray frame
    gray_frame = cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY)

The above lines of code capture the live frame in the 'cap' variable and extract it into the 'frame' variable. The frame is converted into gray color for making the frame uniform before feeding it into tesseract for text extraction.

Text Extraction:

# Extracting the data from frame in form of dictionary
    frame_data = pytesseract.image_to_data(gray_frame,output_type=Output.DICT)

The above code extracts the text in the 'frame_data' variable. The data is extracted in the dictionary format having left, right, top, width, height, conf, etc as the keys. It contains information regarding whole data in the frame along with its positional coordinates. 

Showing text on the frame:

# Setting the coordinates of the text scanned
    for i in range(len(frame_data['text'])):

        # x-> corrdinate from left, y-> top coordinate, 
        # w-> width of the text, h-> height of the text
        x = frame_data['left'][i]
        y = frame_data['top'][i]
        w = frame_data['width'][i]
        h = frame_data['height'][i]

        accuracy = frame_data['conf'][i]

        # Showing data only if accuracy is more than 20% 
        # You can also change the accuracy but it highly depends upon the quality of 
        # scanned frame and the data
        if int(accuracy) > 10:
            # Setting the text 
            text = frame_data['text'][i]
            text = "".join([c for c in text]).strip()
            # Placing the text on the frame   

    # Showing the frame
    cv2.imshow("Text Frame",frame)

    if cv2.waitKey(1) & 0xff==ord('q'):

# Releasing the frame and closing frame window

The above code stores the left, top coordinates as well as width and height coordinates for getting the text boundary. Next, the 'accuracy' stores how accurately the text is recognized. If the accuracy is more than 10% then joining the text in the given line and placing it on the frame using the coordinates. You can also change the accuracy but for that, a better optimization technique like threshold, gaussian blur, etc need to be used for getting more saturated and less noise frame.

After the text, the text frame is displayed showing the text present in the captured frame. The program exits only if 'q' is pressed as given by ord('q') otherwise the capturing will continue infinitely.

The frame is released and the frame window is closed. It can be used for doc scanners with high optimization algorithms and better saturation techniques. 

Download Complete Code


No comments yet