NVIDIA researchers propose Global Context Vision Transformer (GC ViT): a new architecture that improves the use of parameters and computers

Vision Transformer (ViT) has become one of the most advanced architectures for computer vision (CV) problems in contemporary transformer architectures related to natural language processing. Compared to traditional CNN techniques, this transformer-based model shows exceptional capabilities in modeling short- and long-range information. The quadratic computational complexity required by ViT, making high-resolution image modeling prohibitive, is the fundamental limitation for the further development and application of ViT. A team of NVIDIA researchers has proposed a unique yet simple hierarchical ViT design called Global Context Vision Transformer (GC ViT). This architecture’s global self-attention and token generation modules enable effective modeling without expensive computation, while delivering advanced performance for various computer vision tasks. The team proposed this architecture in their recent paper entitled Global Context Vision Transformers.

The GC ViT architecture has a hierarchical framework that easily captures function representations at different resolutions. When an input image is given, the model applies a predefined convolutional layer with appropriate padding to produce overlapping patches. According to the research team, this approach can be used as a general framework for various computer vision tasks, including classification, detection and instance segmentation. The simple construction of the model, which enables the modeling of short- and long-range connections by capturing global contextual information, reduces the need for complex calculations. The proposed GC ViT outperforms both CNN and ViT-based models by a wide margin and achieves new state-of-the-art benchmarks on the ImageNet-1K dataset for various model sizes and FLOPs. GC ViT also achieves SOTA performance on the MS COCO and ADE20K datasets for object detection and semantic segmentation.

Source: https://arxiv.org/pdf/2206.09959.pdf

Each GC ViT processing stage alternates between local and global self-attention modules to extract spatial features. The global self-attention mechanism has access to the innovative features of the Global Token Generator. The generated attributes are then sent via mean pooling and linear layers to provide an embedding for subsequent tasks. In their empirical experiments, the researchers tested the proposed GC ViT on CV tasks such as image classification, objection detection, instance segmentation, and semantic segmentation. The team’s proposed architecture can be summarized to efficiently capture the overall context and achieve SOTA performance on resume tasks. Although GC ViT does not increase computational costs, training is still somewhat expensive regardless of the transformer architecture. The researchers add that strategies such as reduced precision or quantization can make GC ViT training more effective. The GC ViT code can also be accessed on the project’s GitHub page.

This Article is written as a summary article by Marktechpost Staff based on the research paper 'Global Context Vision Transformers'. All credit for this research goes to researchers on this project. Checkout the paper, github.

Please Don't Forget To Join Our ML Subreddit

Leave a Comment

Your email address will not be published.