Meta AI and University of Texas at Austin Researchers Open-Source Three New ML Models for Audiovisual Understanding of Human Speech and Sounds in Video Designed for AR and VR

Acoustics have a major influence on how we perceive moments. As society moves towards mixed and virtual realities, ongoing research is being done to produce high quality sound that accurately reflects a person’s environment. AI models must be able to understand a person’s physical environment based on what that environment looks and sounds like. This explains the reality that how we perceive audio depends on the geometry of the physical space, the materials and surfaces nearby, and the distance from the source of the sounds. Researchers from the University of Texas at Austin and Meta’s Reality Labs have been working to open source three new models for audiovisual interpretation of human voice and video sounds. These models will bring us closer to such a reality more quickly. Three different audiovisual tasks are central to the models. The core Visual Acoustic Matching model called AViTAR modifies an audio clip to sound like it was recorded in the environment by using it as input along with an image of the target environment. To enable the transformer to perform intermodality reasoning and provide a realistic audio output corresponding to the visual input, a cross-modal transformer model was used where the input consists of both images and audio. Despite the absence of acoustically mismatched audio and untagged data in in-the-wild web videos, the self-supervised training target learns acoustic matching from them. This transformer-based model was trained using two data sets.

In some circumstances it becomes necessary to eliminate reverberation to improve hearing and understanding. By eliminating reverberation using recorded sounds and the visual cues of a place, the second method, Visually-Informed Dereverberation (VIDA), does the reverse of the first model. Using both simulated and real images, the method was examined for speech magnification, speech recognition and speaker identification. It’s safe to say that VIDA delivers state-of-the-art performance and represents a significant advance over conventional audio-only techniques. The team sees this as an important step towards realism in mixed and virtual reality. The third approach, VisualVoice, distinguishes speech from other background sounds and voices using visual and acoustic cues. This model was created to support tasks that require machine knowledge, such as correcting subtitles or socializing at a party in virtual reality. The model is made to look and hear the same way humans do to understand speech in complex environments.

Meta AI’s audiovisual experience research focuses on these three models. AI models in use today perform excellently at understanding photos and movies. However, multimodal AI models are needed to extend such feat to AR and VR. These models must be able to simultaneously process audio, video, and text information to gain a much richer understanding of the world. The researchers soon want to offer their customers a multimodal and immersive experience by letting them relive a memory in the virtual world while enjoying the graphics and sound quality exactly as it is. The acoustical characteristics of the space will be captured using video and other dynamics in future work on multimodal AI. The team is incredibly excited to share its findings with the open source community.

Visual Acoustic Matching Research document Project page

Visually informed reverberation Research document Project page

Visual Voice Research documentProject page

Reference article:

Please Don't Forget To Join Our ML Subreddit

Leave a Comment

Your email address will not be published. Required fields are marked *