AI researchers at Amazon develop a new technique by training a neural network to have better joint representations of image and text

Co-embedding of image and text is the foundation of most Vision-and-Language (V+L) tasks, processing multimodal input simultaneously for a combined visual and textual understanding. It functions by requiring an input image and a written description of the image. The study of image-text feature alignment and the training of neural networks to generate collaborative representations of images and their words have both received tremendous attention. These views are useful for a variety of computer vision applications, including text and image searches.

A new method of aligning images and texts sees them as different “representations” of the same thing and uses a codebook of cluster centers to cover the entire combined visual language coding space.

Joint image-text models are often trained through contrastive learning. The model is given training samples in pairs, one positive and one negative, and learns to push the positive and negative instances apart in the representation space. The model would then learn to associate images with appropriate labels in a standard, multimodal representative space, for example by being trained on pairs of photos with corresponding text labels, one of which was the correct label and the other random.

Robust contrastive learning can force alignment between different modalities, leading to degradation of the learned characteristics. Learning image-text representation aims to cluster images with the corresponding texts. In contrast, a neural network trained on multimodal inputs would naturally prefer to group data of the same type in the representative space. Amazon researchers are exploring different methods of adding more structure to the representation space to solve the problem of developing more reliable image-text alignments.

A new method of matching images and texts sees them as different “views” of the same thing. It uses a codebook of cluster centers to cover the combined coding space in visual language. It can be seen as displaying an image and a text as two different perspectives of the same object and using a codebook of cluster centers to cover the combined coding space of the visual language. Whether the notions are represented visually or textually, each center is the anchor for a group of connected ideas. “Multimodal alignment using representation codebook” suggests using cluster representation to align images and text at a higher, more stable level to counteract this tendency.

Researchers contrast positive and negative data during training using their cluster assignments while optimizing the cluster centers. In addition, these models can compete on several additional transfer tasks. The ability of contrastive learning to optimize the mutual information between image and text pairs, or the extent to which image features can be expected, has been credited with its effectiveness in training image-text alignment models. However, cross-modal alignment (CMA) alone lacks potentially significant connections within each modality. For example, CMA does not ensure that inputs of the same modality are comparable to each other, while mapping image-text pairs closely in the embedding space. If the pre-training data is noisy, the problem can get worse. Researchers use triple contrastive learning (TCL) for pre-training in vision language. This method uses cross-modal and intra-modal self-supervision or training on designed tasks so that labeled training examples are not necessary.

The method aims at maximizing the average mutual information between local parts of the image/text and their overall summary. The use of localized and structural information from image and text input makes this research innovative. Experimental reviews show that this technique is competitive and achieves the new state-of-the-art on several common downstream vision language tasks, including image-text retrieval and answering visual questions.

This Article is written as a summary article by Marktechpost staff based on the Amazon's research papers: Paper 1 and Paper 2. All credit for this research goes to researchers on this project. 

Please Don't Forget To Join Our ML Subreddit

Leave a Comment

Your email address will not be published. Required fields are marked *