The project focuses on generating text based on the input image using Gradio Interface.
In this project, we will learn how can we generate caption/text from images using Vision Encoder-Decoder Model.
The VisionEncoderDecoderModel is an image-to-text model that combines the characteristics learned by a Transformer-based vision model (encoder) with the language comprehension skills of a pre-trained language model (decoder).
The role of the vision model is to extract information from the input picture, while the language model is in charge of producing captions based on these data. The VisionEncoderDecoderModel can generate captions that are both descriptive and linguistically correct by merging these two models.
There are many pre-trained vision models and language models that may be used as the encoder and decoder, respectively, allowing for flexibility and modification based on the demands of the given use case.
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei demonstrates the usefulness of employing pre-trained checkpoints to initialize image-to-text-sequence models.
VisionEncoderDecoderModel is a standard model class that is represented as a transformer architecture in which one of the library's base vision model classes serving as an encoder and another serves as a decoder.
Example of code
# load Image captioning model,tokenizer and image processor v_model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # Model v_tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # GPT2 Tokenizer image_process = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # Processing visual inputs inf_url = "http://images.cocodataset.org/val2017/000000039769.jpg" # Img url img = Image.open(requests.get(inf_url, stream=True).raw) # Open Image in PIL format pixel_values = image_process(img, return_tensors="pt").pixel_values # Preprcoess image
Some important parameters related to this model:
inf_url: Url of the Image.
pixels_values (torch. Shape FloatTensor (batch size, num channels, height, width) — Pixel values. An image processor can be used to obtain pixel values (e.g. if you use ViT as the encoder, you should use AutoImageProcessor). See ViTImageProcessor for more information. For more information, see call().
The "from pretrained" method loads the pre-trained tokenizer along with other parameters, allowing it to tokenize GPT-2 model captions. The tokenizer is used to translate the produced captions into numerical representations that the GPT-2 model may use.
ViTImageProcessor is an NLP-Connect library custom class used to handle visual inputs for a pre-trained GPT-2 model for image captioning.
GPT2TokenizerFast is an NLP-Connect library custom class used to tokenize text for use with the GPT-2 model for picture captioning.
To Read More click on Vision Encoder-Decoder Model
In this project, we are using Google Collab IDE with Python version 3.9. In our IDE we need to install certain libraries like :
pip install torch
pip install gradio
pip install transformers
Let's import all the required libraries in order to generate text from images.
import torch from transformers import ViTFeatureExtractor, AutoTokenizer, VisionEncoderDecoderModel import gradio as gr
Let's load the pre-trained model from the transformer library.
locc = "ydshieh/vit-gpt2-coco-en" feature_extractor = ViTFeatureExtractor.from_pretrained(locc) tokenizer = AutoTokenizer.from_pretrained(locc) model = VisionEncoderDecoderModel.from_pretrained(locc) model.eval()
The model will start downloading in the environment
To caption images, the below code employs the Hugging Face Transformers library. It starts with a pre-trained ViT (Vision Transformer) model for feature extraction, followed by an AutoTokenizer for tokenization and a VisionEncoderDecoderModel for caption generation.
The predict function accepts an image and returns its caption. The image is passed through the feature extractor to obtain the pixel values, which are then fed into the VisionEncoderDecoderModel's generate method to generate a caption. The model's output is then decoded using the tokenizer, and the first caption is returned after any leading or trailing whitespaces are removed.
def predict(image): pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values with torch.no_grad(): output_ids = model.generate(pixel_values, max_length=16, num_beams=4, return_dict_in_generate=True).sequences preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True) preds = [pred.strip() for pred in preds] return preds[0]
Let's build an Interface that takes input as an image and displays output as text using Gradio.
To learn more about Gradio click:https://gradio.app/
Code
inputs = gr.inputs.Image() outputs = gr.outputs.Textbox() golden_gate = ["/content/golden_gate_bridge.jpeg", "/content/the_great_wave.jpeg"] joshua_tree = ["/content/joshua_tree.jpeg", "/content/starry_night.jpeg"] glacier = ["/content/glacier_national_park.jpeg", "/content/the_scream.jpg"] interface = gr.Interface( predict, inputs, outputs, title="Image-to-Text", description="Generate Text from Images from pre trained Transformer-based vision model. ", examples=[glacier, golden_gate, joshua_tree] ) interface.launch(debug=True)
Output
Submitted by Amanpreet Singh (Aman9868)
Download packets of source code on Coders Packet
Comments