shlogg · Early preview
Mike Young @mikeyoung44

Software Engineers Can Learn From Audio-Visual Representation Models

Computers learn to connect images & sounds with self-supervised technique, separating "chirp" (env sounds) from "chat" (speech), enabling better understanding of multimodal world.

This is a Plain English Papers summary of a research paper called Separating the Chirp from the Chat: Self-supervised Visual Grounding of Sound and Language. If you like these kinds of analysis, you should subscribe to the AImodels.fyi newsletter or follow me on Twitter.

  
  
  Overview

This paper introduces a self-supervised approach for learning visual representations from audio-visual correspondence.
The method aims to separate "chirp" (environmental sounds) from "chat" (speech) and learn visual representations that are grounded in both types of audio.
The learned representations can be...