Google Translate 600B transformer
Today in gargantuan language models: Google’s new state-of-the-art model for translating from 100 languages to English has 600 billion parameters. Compare this to OpenAI’s GPT-3 at 175 billion parameters from June (see DT #42) and Microsoft’s Turing-NLG at 17 billion parameters from February (DT #33). Google’s 600 billion-parameter Transformer took four days to train on 2048 (!) TPUs, which is actually relatively little for that model size. This training process is therefore also the focus of the paper describing the model: Lepikhin et al. (2020) introduce GShard, “an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code.”