By sharding the training data across multiple GPUs and training multiple mini-batches in parallel, we aimed to achieve significant reductions in training time. We present empirical results showcasing the observed reduction in training time when we scaled up resources from 1 to N GPUs, and share some future directions we are considering in our continued effort to speed up model training.
Read more: https://multithreaded.stitchfix.com/blog/2023/06/08/distributed-model-training/