Researchers of Adobe and the University of North Carolina (UNC) have open source CLIP-S, an AI image captioning model that produces detailed descriptions of images. In evaluations with captions generated by other models, human judges most time preferred those generated by CLIP-S.
The model and experiments were described in a paper filed with the 2022 Annual Conference of the North American Branch of the Association for Computational Linguistics (NAACL). CLIP-S uses a Transformer model to generate captions on an input image. During training, the model uses CLAMP to determine how well the generated caption describes the image; this score is used as a reward signal for reinforcement learning (RL). To improve the grammar of the generated captions, the team refined CLIP with examples of negative captions, which were generated by randomly modifying the reference captions. To address the shortcomings of existing image caption evaluation methods, the team also developed a new benchmark dataset, FineCapEval, which includes more fine-grained image captions that describe image backgrounds and relationships between objects. According to the research team
The reference captions of public datasets often describe only the most prominent objects in the images. This causes models trained to maximize textual similarity to reference captions tend to generate less distinctive captions that ignore the finely detailed aspects of an image that set it apart from others.
Many image captioning models are trained on data sets consisting of input images and reference captions; the training target measures the similarity of the generated caption to the reference caption, using metrics such as: BLUE† However, this often results in models generating generic captions describing only the prominent objects in the image, ignoring fine details that make the image distinctive.
To address this issue, the Adobe team chose to use OpenAI’s CLIP model to measure the accuracy of the captions generated. CLIP measures the similarity between an image and a text string; the better the text describes the image, the closer the match. The researchers used this CLIP score to create a reward function, CLIP-S, for RL training to produce their caption model.
However, the team found that this model often generated grammatically incorrect captions, for example by repeating words: “multiple rows of aircraft parked outside a terminal window area with fog outside a terminal window motion position area motion.” Their solution was to refine the text encoding portion of CLIP by providing negative examples with randomly repeated, inserted or shuffled tokens. They also introduced a two-layer perceptron classification header that detects whether a sentence is grammatically correct, and trains this along with the fine-tuning of the text encoder.
The team also created FineCapEval, a new benchmark dataset for evaluating fine-grained models for image captioning. This dataset contains 500 images of the MS COCO test split and the Conceptual Captions split validation. For each image, five human workers wrote descriptions of: the background of the image; the objects in the image, including shape and color; the relationships between the objects, such as spatial relationships; and a detailed caption with all the first three aspects. The dataset contains a total of 1k images with 5k captions for each of these four criteria.
To evaluate their model, the team compared the captions with those of several base models, using the COCO dataset as a benchmark. While a base model outperformed CLIP-S on text-based metrics such as BLEU, CLIP-S outperformed image-text-based metrics and on text-to-image retrieval metrics. It also “significantly” outperformed the baselines of the team’s new FineCapEval benchmark. Finally, human judgments “strongly” prefer captions generated by CLIP-S over those generated by baseline models.
Multimodal image-text AI models are an active research topic. InfoQ recently reported on DeepMind .’s Flamingo Model, which features state-of-the-art little-shot learning capabilities for various image text tasks, including image captioning. Last year InfoQ reported on: Google’s ALIGN model and further AliBaba’s M6 Modelboth of which can perform a variety of image-text tasks.
The CLIP-S code and the FineCapEval dataset are available on GitHub.