Coders Packet

Image to Text: Generating Captions with Vision-Encoder-Decoder model

By Amanpreet Singh

The project focuses on generating text based on the input image using Gradio Interface.

In this project, we will learn how can we generate caption/text from images using Vision Encoder-Decoder Model.


The VisionEncoderDecoderModel is an image-to-text model that combines the characteristics learned by a Transformer-based vision model (encoder) with the language comprehension skills of a pre-trained language model (decoder).

The role of the vision model is to extract information from the input picture, while the language model is in charge of producing captions based on these data. The VisionEncoderDecoderModel can generate captions that are both descriptive and linguistically correct by merging these two models.

There are many pre-trained vision models and language models that may be used as the encoder and decoder, respectively, allowing for flexibility and modification based on the demands of the given use case.

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei demonstrates the usefulness of employing pre-trained checkpoints to initialize image-to-text-sequence models.

VisionEncoderDecoderModel is a standard model class that is represented as a transformer architecture in which one of the library's base vision model classes serving as an encoder and another serves as a decoder.

Example of code 

# load Image captioning model,tokenizer and image processor
v_model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # Model
v_tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # GPT2 Tokenizer
image_process = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # Processing visual inputs

inf_url = "" # Img url
img =, stream=True).raw) # Open Image in PIL format
pixel_values = image_process(img, return_tensors="pt").pixel_values # Preprcoess image

Some important parameters related to this model:

inf_url: Url of the Image.

pixels_values (torch. Shape FloatTensor (batch size, num channels, height, width) — Pixel values. An image processor can be used to obtain pixel values (e.g. if you use ViT as the encoder, you should use AutoImageProcessor). See ViTImageProcessor for more information. For more information, see call().

The "from pretrained" method loads the pre-trained tokenizer along with other parameters, allowing it to tokenize GPT-2 model captions. The tokenizer is used to translate the produced captions into numerical representations that the GPT-2 model may use.

ViTImageProcessor is an NLP-Connect library custom class used to handle visual inputs for a pre-trained GPT-2 model for image captioning.

GPT2TokenizerFast is an NLP-Connect library custom class used to tokenize text for use with the GPT-2 model for picture captioning.

To Read More click on Vision Encoder-Decoder Model



In this project, we are using Google Collab IDE with Python version 3.9. In our IDE we need to install certain libraries like :

pip install torch

pip install gradio

pip install transformers


Import Libraries

Let's import all the required libraries in order to generate text from images.

import torch
from transformers import ViTFeatureExtractor, AutoTokenizer, VisionEncoderDecoderModel
import gradio as gr

Load Model

Let's load the pre-trained model from the transformer library.

locc = "ydshieh/vit-gpt2-coco-en"

feature_extractor = ViTFeatureExtractor.from_pretrained(locc)
tokenizer = AutoTokenizer.from_pretrained(locc)
model = VisionEncoderDecoderModel.from_pretrained(locc)

The model will start downloading in the environment

Generate Caption

To caption images, the below code employs the Hugging Face Transformers library. It starts with a pre-trained ViT (Vision Transformer) model for feature extraction, followed by an AutoTokenizer for tokenization and a VisionEncoderDecoderModel for caption generation.

The predict function accepts an image and returns its caption. The image is passed through the feature extractor to obtain the pixel values, which are then fed into the VisionEncoderDecoderModel's generate method to generate a caption. The model's output is then decoded using the tokenizer, and the first caption is returned after any leading or trailing whitespaces are removed.

def predict(image):

    pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values

    with torch.no_grad():
        output_ids = model.generate(pixel_values, max_length=16, num_beams=4, return_dict_in_generate=True).sequences

    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    preds = [pred.strip() for pred in preds]

    return preds[0]

Build Interface with Gradio

Let's build an Interface that takes input as an image and displays output as text using Gradio.

  • Call the predict() function which we created earlier.
  • Make an input box that upload image and an output box that show predicted captions.
  • Define the title="Image to Text".
  • Define description in Interface.
  • Launch the webserver.

To learn more about Gradio click:


inputs = gr.inputs.Image()
outputs = gr.outputs.Textbox()

golden_gate = ["/content/golden_gate_bridge.jpeg", "/content/the_great_wave.jpeg"]
joshua_tree = ["/content/joshua_tree.jpeg", "/content/starry_night.jpeg"]
glacier = ["/content/glacier_national_park.jpeg", "/content/the_scream.jpg"]

interface = gr.Interface(
    description="Generate Text from Images from pre trained Transformer-based vision model. ",
    examples=[glacier, golden_gate, joshua_tree]





Download Complete Code


No comments yet

Download Packet

Reviews Report

Submitted by Amanpreet Singh (Aman9868)

Download packets of source code on Coders Packet