[编辑]
1-bit Adam:具有Adam收敛速度的通信高效大规模训练
唐汉林,甘少多,Ammar Ahmad Awan,Samyam Rajbhandari,李丛龙,连祥如,刘骥,张策,何昱雄第38届国际机器学习会议论文集,PMLR 139:10118-10129,2021。
摘要
大规模模型的训练(如BERT和GPT-3)需要仔细的优化,其根源在于模型设计、架构和系统能力。从系统角度来看,通信已成为一个主要的瓶颈,尤其是在使用标准TCP互连的通用系统上,这些系统提供的网络带宽有限。通信压缩是一种减少此类系统上训练时间的重要技术。压缩通信的最有效方法之一是通过误差补偿压缩,即使在1位压缩下也能提供强大的收敛速度。然而,最先进的误差补偿技术仅适用于基本的优化器,如SGD和动量SGD,它们与梯度线性相关。它们不适用于基于非线性梯度的优化器,如Adam,后者为BERT等模型提供最先进的收敛效率和准确性。在本文中,我们提出了1-bit Adam,它将通信量减少高达5倍,提供更好的可扩展性,并提供与未压缩Adam相同的收敛速度。我们的关键发现是,Adam的方差在预热阶段后变得稳定,可以用作训练其余部分(压缩阶段)的固定预条件。我们在高达256个GPU上进行了实验,结果表明,1-bit Adam能够为BERT-Large预训练提供高达3.3倍的吞吐量,为SQuAD微调提供高达2.9倍的吞吐量。此外,我们还为1-bit Adam提供了理论分析。
引用本文
BibTeX
@InProceedings{pmlr-v139-tang21a, title = {1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed}, author = {Tang, Hanlin and Gan, Shaoduo and Awan, Ammar Ahmad and Rajbhandari, Samyam and Li, Conglong and Lian, Xiangru and Liu, Ji and Zhang, Ce and He, Yuxiong}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {10118--10129}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {https://pmlr.com.cn/v139/tang21a/tang21a.pdf}, url = {https://pmlr.com.cn/v139/tang21a.html}, abstract = {Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to 5x, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam’s variance becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). We performed experiments on up to 256 GPUs and show that 1-bit Adam enables up to 3.3x higher throughput for BERT-Large pre-training and up to 2.9x higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for 1-bit Adam.} }
Endnote
%0 会议论文 %T 1-bit Adam:具有Adam收敛速度的通信高效大规模训练 %A 唐汉林 %A 甘少多 %A Ammar Ahmad Awan %A Samyam Rajbhandari %A 李丛龙 %A 连祥如 %A 刘骥 %A 张策 %A 何昱雄 %B 第38届国际机器学习会议论文集 %C 机器学习研究会议论文集 %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-tang21a %I PMLR %P 10118--10129 %U https://pmlr.com.cn/v139/tang21a.html %V 139 %X Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to 5x, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam’s variance becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). We performed experiments on up to 256 GPUs and show that 1-bit Adam enables up to 3.3x higher throughput for BERT-Large pre-training and up to 2.9x higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for 1-bit Adam.
APA
唐汉林,甘少多,Awan, A.A.,Rajbhandari, S.,李丛龙,连祥如,刘骥,张策,何昱雄。(2021)。1-bit Adam:具有Adam收敛速度的通信高效大规模训练。第38届国际机器学习会议论文集,发表于机器学习研究会议论文集 139:10118-10129。可从 https://pmlr.com.cn/v139/tang21a.html 获取。