Microsoft's DeepSpeed update
Microsoft has updated DeepSpeed, its open-source library for efficiently training massive ML models (see DT #34, #40), with four big improvements: 3D parallelism for training trillion-parameter models; ZeRO-Offload for 10x bigger model training on a single GPU; Sparse Attention kernels for 10x longer input sequences in Transformers; and 1-bit Adam for reducing network load in multi-GPU training. My work focuses on tiny models rather than large ones, so I haven’t gotten a chance to try DeepSpeed, but if any of you have, I’d love to hear about your experience!