[编辑]
DeepSpeed-MoE:推进混合专家推理和训练,助力下一代人工智能规模
Samyam Rajbhandari,Conglong Li,Zhewei Yao,Minjia Zhang,Reza Yazdani Aminabadi,Ammar Ahmad Awan,Jeff Rasley,Yuxiong He第39届国际机器学习会议论文集,PMLR 162:18332-18346,2022。
摘要
随着巨型稠密模型训练受到当前硬件资源可用性和能力的限制,混合专家(MoE)模型因其与质量相当的稠密模型相比,显著降低了训练成本,而成为最具前景的模型架构之一。其训练成本节省已从编码器-解码器模型(先前工作)扩展到自动回归语言模型(本工作)的5倍节省。然而,由于模型尺寸更大且架构独特,如何提供快速的MoE模型推理仍然具有挑战性且未解决,限制了它们的实际应用。为了解决这个问题,我们提出了DeepSpeed-MoE,一个端到端的MoE训练和推理解决方案,包括新颖的MoE架构设计和模型压缩技术,可将MoE模型尺寸减少高达3.7倍,以及高度优化的推理系统,与现有的MoE推理解决方案相比,延迟和成本降低了7.3倍。DeepSpeed-MoE提供前所未有的规模和效率,能够服务于大型MoE模型,与质量相当的稠密模型相比,推理速度提高高达4.5倍,成本降低9倍。我们希望我们的创新和系统能够为大型模型领域开辟一条充满希望的道路,从稠密模型向稀疏MoE模型的转变,从而能够使用更少的资源训练和部署更高质量的模型。
引用本文
BibTeX
@InProceedings{pmlr-v162-rajbhandari22a, title = {{D}eep{S}peed-{M}o{E}: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation {AI} Scale}, author = {Rajbhandari, Samyam and Li, Conglong and Yao, Zhewei and Zhang, Minjia and Aminabadi, Reza Yazdani and Awan, Ammar Ahmad and Rasley, Jeff and He, Yuxiong}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {18332--18346}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://pmlr.com.cn/v162/rajbhandari22a/rajbhandari22a.pdf}, url = {https://pmlr.com.cn/v162/rajbhandari22a.html}, abstract = {As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models have become one of the most promising model architectures due to their significant training cost reduction compared to quality-equivalent dense models. Their training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting their practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.} }
Endnote
%0 会议论文 %T DeepSpeed-MoE:推进混合专家推理和训练,助力下一代人工智能规模 %A Samyam Rajbhandari %A Conglong Li %A Zhewei Yao %A Minjia Zhang %A Reza Yazdani Aminabadi %A Ammar Ahmad Awan %A Jeff Rasley %A Yuxiong He %B 第39届国际机器学习会议论文集 %C 机器学习研究会议论文集 %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-rajbhandari22a %I PMLR %P 18332--18346 %U https://pmlr.com.cn/v162/rajbhandari22a.html %V 162 %X As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models have become one of the most promising model architectures due to their significant training cost reduction compared to quality-equivalent dense models. Their training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting their practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.
APA
Rajbhandari, S.,Li, C.,Yao, Z.,Zhang, M.,Aminabadi, R.Y.,Awan, A.A.,Rasley, J. & He, Y.(2022)。DeepSpeed-MoE:推进混合专家推理和训练,助力下一代人工智能规模。第39届国际机器学习会议论文集,发表于机器学习研究会议论文集 162:18332-18346。可从 https://pmlr.com.cn/v162/rajbhandari22a.html 获取。