is DeepMind’s new general-purpose architecture for processing a wide variety of input modalities — like images, videos, 3D point clouds, and sounds — into output vectors.
(without the IO) scaled Transformers’ concept of attention to much larger input sizes, “without introducing domain-specific assumptions,” by first encoding the inputs to a small fixed-size latent array and attending over that.
Now, Perceiver IO (arXiv
) extends this by also applying attention to the decoding side, so that one input can produce multiple outputs and both the inputs and outputs can be a mix of modalities.
“This opens the door for all sorts of applications, like understanding the meaning of a text from each of its characters, tracking the movement of all points in an image, processing the sound, images, and labels that make up a video, and even playing games, all while using a single architecture that’s simpler than the alternatives.” With OpenAI releasing DALL·E and CLIP
and Stanford HAI launching the Foundation Models research center
, both also this year, these large multimodal networks have become a central focus of leading AI labs.