Wikipedia-based image-text dataset
Big AI labs’ focus on multimodal neural networks — architectures that combine different types of input and output data, like images and text — continues. After OpenAI’s DALL·E and CLIP, Stanford HAI’s Foundation Models, and DeepMind’s Perceiver IO, Google AI has now announced WIT: a Wikipedia-based image-text dataset. Bridging the gap between human-annotated image captions (too labor-intensive) and broad web-scraped ones (too messy and English-centric), Srinivasan et al. (2021) created WIT by “extracting multiple different text selections associated with an image from Wikipedia articles and Wikimedia image links.” This results in a Creative Commons-licensed dataset of 37.5 million image-text examples, across 11.5 million unique images and 108 languages. Until now these big multimodal models have mostly been trained on proprietary datasets by large private labs; this open dataset should help lower the barrier to entry for university labs to research similar models.