![LSH-MoE:通過局部敏感哈希實(shí)現(xiàn)通信高效的專家混合模型訓(xùn)練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第1頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo922.jpg)
![LSH-MoE:通過局部敏感哈希實(shí)現(xiàn)通信高效的專家混合模型訓(xùn)練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第2頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9222.jpg)
![LSH-MoE:通過局部敏感哈希實(shí)現(xiàn)通信高效的專家混合模型訓(xùn)練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第3頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9223.jpg)
![LSH-MoE:通過局部敏感哈希實(shí)現(xiàn)通信高效的專家混合模型訓(xùn)練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第4頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9224.jpg)
![LSH-MoE:通過局部敏感哈希實(shí)現(xiàn)通信高效的專家混合模型訓(xùn)練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第5頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9225.jpg)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
LSH-MoE:Communication-efficientMoETrainingviaLocality-SensitiveHashing
XiaonanNie1QibinLiu1FangchengFu1ShenhanZhu1XupengMiao2
XiaoyangLi3YangZhang3ShoudaLiu3BinCui1
arXiv:2411.08446v1[cs.DC]13Nov2024
1PekingUniversity2PurdueUniversity3ByteDance
1{xiaonan.nie,2101212782,ccchengff,shenhan.zhu,bin.cui}@.cn
2xupeng@3{lixiaoyang.x,zhangyang.elfin,liushouda}@
Abstract
Largertransformermodelsalwaysperformbetteronvarioustasksbutrequiremorecoststoscaleupthemodelsize.Toefficientlyenlargemodels,themixture-of-experts(MoE)architectureiswidelyadopted,whichconsistsofagatenetworkandaseriesofexpertsandkeepthetrainingcostconstantbyroutingtheinputdatatoafixednumberofexpertsinsteadofall.Inexistinglarge-scaleMoEtrainingsystems,expertswouldbedistributedamongdifferentGPUsforparallelization,andthusinputdatarequiresadditionalall-to-allcommunicationstoaccessthetargetexpertsandconductcorrespondingcomputations.However,uponevaluatingthetrainingprocessofthreemainstreamMoEmodelsoncommonlyusedGPUclusters,wefoundthattheall-to-allcommunicationratioaveragedaround45%,whichsignificantlyhinderstheefficiencyandscalabilityoftrainingMoEmodels.Inthispaper,weproposeLSH-MoE,acommunication-efficientMoEtrainingframe-workusinglocality-sensitivehashing(LSH).WefirstpresenttheproblemsofscalingMoEtraininginexistingsystemsandhighlightthepotentialofexploitingtokensimilaritytofacilitatedatacompression.Then,weintroduceanefficientLSH-basedcompressiontechnique,whichutilizesthecross-polytopehashingforrapidclusteringandimplementsaresidual-basederrorcompensationschemetoalleviatetheadverseimpactofcompression.Toverifytheeffectivenessofourmethods,weconductexperimentsonbothlanguagemodels(e.g.,RoBERTa,GPT,andT5)andvisionmodels(e.g.,Swin)forpre-trainingandfine-tuningtasks.Theresultsdemonstratethatourmethodsubstantiallyoutperformsitscounterparts
acrossdifferenttasksby1.28×-2.2×ofspeedup.
1Introduction
Inrecentyears,large-scalepre-trainedmodelshavesignificantlyadvancedtheperformanceof
deeplearningacrossvariouscomplextasks,includingcomputervision[8,
20],naturallanguage
processing[3,
7,
28],andmulti-modallearning[19]
.Commonlyreferredtoasfoundationmodels,
thesepre-trainedmodelsareprimarilybuiltonTransformerarchitectures[34]andundergoextensive
pre-trainingonlargedatasets,utilizingsubstantialGPUresources.OpenAIhasvalidatedthescaling
lawforlargelanguagemodels[15]andsuggeststhatincreasingthemodel’sparametersize,thevolume
oftrainingdata,andthedurationoftrainingcansignificantlyenhancethemodel’sperformance.However,thisapproachresultsinaconsiderableriseintrainingcosts,makingthedevelopmentoffoundationmodelsextremelyexpensive.
XiaonanNie,QibinLiu,FangchengFu,ShenhanZhu,andBinCuiarewiththeSchoolofComputerScienceandKeyLabofHighConfidenceSoftwareTechnologies(MOE),PekingUniversity.BinCuiisalsowiththeInstituteofComputationalSocialScience,PekingUniversity(Qingdao).
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
2
Toreducethehighcomputationalcosts,thesparsemixture-of-experts(MoE)architectureisoftenadopted,whichcomprisesasparsegatenetworkandaseriesofexpertnetworks.Thisarchitectureroutesinputdatatoonlyasubsetofexperts,resultinginsparseactivationoftheexpertsandtherebyreducingthemodel’scomputationalFLOPs(floatpointoperations)aswellastrainingcosts.
ProminentmodelssuchasGoogle’sSwitch-Transformer[9],ST-MoE[41],Meta’sHashLayer[31]
andMistral-AI’smixtralmodels[14]havesuccessfullyimplementedthisdesign,demonstrating
improvementsinbothperformanceandefficiencywithMoEmodels.
Meanwhile,effectivelyscalingthetrainingofMoEmodelsacrosshundredsoreventhousandsofGPUsremainsasignificantchallenge.ResearchersfromGooglehaveproposedtheexpertparallelism
approach[17],whichreplicatesthegatingnetworkoneachGPUsanddistributesdifferentexperts
acrossmultipleGPUsforparallelprocessing.Specifically,eachinputtokenisinitiallyprocessedbythegatingnetworktoselecttheappropriateexpert,afterwhichitisroutedtothedesignatedexpertsviapeer-to-peer(P2P)networkcommunication.Oncethedesignatedexpertscompletetheircomputation,thetokenisreturnedtotheoriginalGPUforfurtherprocessingthroughanadditionalP2Pcommunication.SinceeachGPUtypicallyneedstoexchangedatawithmanyotherGPUs,theseP2Ptransmissionsresultsinanall-to-allcommunicationpattern.Moreover,becausethecomputationoftheexpertnetworkreliesontheoutcomesofthesecommunications,thecommunicationscannotbeeffectivelyoverlappedwithongoingcomputations.ThisdependencycreatesasignificantperformancebottleneckinmodeltrainingacrossmostcommonlyusedGPUclusters.Weconductedexperimentsonthreewidely-usedMoEmodels,includingRoBERTa-MoE,GPT-MoEandSwin-MoE,onfourA100servers,eachwithacross-machinebandwidthof200Gb/s.Theresults,asshowninFigure
3,
revealthatthetimecostofall-to-allcommunicationconstitutesanaverageof45%andcanreachupto67%ofthetotalmodeltrainingtime.
ExistingmethodstoimprovedistributedMoEtrainingonbandwidth-limitedclusterstacklecommuni-cationchallengesinvariousways.
TA-MoE[4]reducescross-machinecommunicationbyadjusting
thegatingnetworktofavorexpertsonthesameserver,whilePre-gatedMoE[13]reducesdependency
betweencommunicationandcomputationthroughapre-gatingmechanismthatplanstokenroutinginadvance.However,bothapproachesrequiremodificationstothegatingmechanismandmodelstructure,limitingtheiruniversalapplicability.
DeepSpeed-MoE[29]introducesPR-MoE,which
selectsoneexpertplusasharedexpert,halvingtheall-to-allcommunicationload.
SCoMoE[40]
organizesall-to-allcommunicationbystructuringdatatransfersalongdifferentdimensionsandcontrollingdatavolumesacrossnetworklevels,andalsoclusterstokenstoimproverouting.However,noneoftheseworksconsiderreducingtheAll-to-AllcommunicationvolumeinMoEtrainingbycompressingtheforwardactivations.Therefore,theycanbeintergratedwithourmethodforfurtherimprovement.
Inthispaper,wepresentLSH-MoE,acommunication-efficientMoEtrainingframeworkthatleverageslocality-sensitivehashingtogroupsimilartokens.Ourkeycontributionsareasfollows:
?WebeginbyidentifyingkeychallengesinscalingMoEtraininginexistingsystems,notingthatall-to-allcommunicationconstitutesanaverageof45%ofthetotaltrainingtime.Addi-tionally,weinvestigatethepotentialofusingtokensimilaritytofacilitatedatacompressiontoreducecommunicationcosts.
?WeproposeanefficientLSH-basedcompressiontechniquethatemployscross-polytopehashingforrapidclustering.Thisapproachtransmitsonlytheclusteringcentroids,sig-nificantlyreducingcommunicationcosts.Tofurtherenhanceaccuracy,weimplementaresidual-basederrorcompensationschemetomitigatethenegativeeffectsofcompression.
?Throughextensiveexperimentswithlanguagemodels(RoBERTa-MoE,GPT-MoE,andT5-MoE)andvisionmodels(Swin-MoE),acrossbothpre-trainingandfine-tuningtasks,wedemonstratethatourmethodmaintainsmodelqualitywhileachievingaspeedupof1.28×-2.2×inend-to-endtrainingtime.
2Background
2.1Mixtures-of-ExpertArchitecture
ToenhancethetrainingefficiencyofTransformermodels,Williametal.
(2022)[9]introduced
aninnovativeparadigm,thesparsemixture-of-eexperts(MoE)architecture,illustratedinFigure
1.
3
fx纟ΣEix
i∈GX
1st
Expert
E1(x)E2(x)E3(x)En(x)
nth
Expert
…
GatingNetwork
t!
Figure1:Mixture-of-ExpertsonasingleGPU.
AlltoAll
AlltoAll
x0
x1
Node0
GatingNetwork
GPU0
1st
Expert
GatingNetwork
GPU1
2nd
Expert
x2
x3
Node1
GatingNetwork
GPU2
3rd
Expert
GatingNetwork
GPU3
4th
Expert
Intra-nodeComm.Inter-nodeComm.
Figure2:TrainingMixture-of-ExpertsonmultipleGPUsasexpertparallelism.
2nd
Expert
3rd
Expert
G:RM→1,NK
Thisarchitectureeffectivelybalancesparametercapacityandtrainingcosts,andcomprisestwokeycomponents:anexpertnetwork(E)andasparsegatenetwork(G).ItisevidentthatMoEmodels,withanequalnumberofactiveparametersperinput,cansignificantlysurpasstheperformanceofdensemodels.Thisbreakthroughhasalsocatalyzedfurtherresearchandtheirapplicationacross
variousindustries,ashighlightedbynumeroussubsequentstudies[5,
14,
22,
23,
25,
30,
39]
.
TheexpertnetworkEiscomposedofmultiplespecializedandseparatenetworks,commonlyreferredtoasexperts,denotedas{EiNrepresentsthenumberofexperts.Additionally,Ei(x)denotestheoutputproducedwhentheinputxisprocessedbythei-thexpert.Eachexpertistrainedtoexcelinaspecificsub-task,suchasinmulti-tasklearning,ortohandlespecificsegmentsofdata,asseeninlanguagemodelingandmulti-modallearning,therebyincreasingtheoverallmodelcapacity.Infoundationalmodels,theMoElayeroftenservesasasubstituteforthetraditionalfeed-forwardnetwork(FFN)layer.WithineachMoElayer,eachFFNfunctionworksasanindividualexpert,significantlyenhancingthemodel’scapabilitytoprocessdiverseandcomplexdatainputs.
ThegatingnetworkGplaysacrucialroleinthesparseMoEarchitecture.Forexample,inaK-waygatedMoEsystem,thegatingnetworkoutputsasetofintegersasEquation
1
todeterminewhichexpertsshouldbeactivated.Thisdecisionisbasedonthecharacteristicsoftheinputitself,allowingforadynamicandefficientallocationofcomputationalresources.Byonlyprocessingeachinputtokenwithaselectedsubsetoftheexpertnetwork,theMoEmodelachievescomputationsparsity,effectivelydecouplingparametercapacityfromtrainingcosts.
G:RM→[1,N]K(1)
Throughtheintegrationofmultiplespecializedexperts,asdescribedbyEquation
2,thesparseMoE
modeliscapableofdeliveringmoreaccurateandefficientpredictionsasf(x).Thisisachievedbyleveragingthespecializedknowledgeembeddedwithineachexpert,combinedwiththestrategicinputallocationmanagedbythegatingnetwork.
纟i(2)
WhileMoE’sprimaryadvantageisdecouplingparametercapacityfromnetworkcost,akeychallengeliesinlearningthegatingparameterseffectively,astheoutput’ssparsitymakesitnon-differentiable.Consequently,muchoftheresearchintheMoEfieldhascenteredondevelopingmethodsforlearninggatingfunctions.
Thesemethodsfallintothreemaincategories,asoutlinedin[6]:routingvia
learnableweighting[9,
24,
30],deterministichashrouting[31],andreinforcementlearning-based
routing[2,
32,
33]
.TheseapproachesprimarilydifferinthedesignofthegatingnetworkGratherthantheexpertnetworkE,andthereforeallencountersimilarscalingchallenges.
2.2ChallengesofScalingMoEModelTraining
WhileMoEmodelswereinitiallydevelopedtofacilitateefficientscalingduringtraining,deployingtheselarge-scalemodelsinpracticalGPU-intensiveenvironmentsposessignificantchallengesin
4
100%
80%
60%
40%
20%0%
RoBERTa-MoE
GPT-MoE
Swin-MoE-L
(16GPUs)
(16GPUs)
(16GPUs)
回All-to-All團(tuán)Others.
(a)16GPUs
100%
80%
60%
40%
20%0%
RoBERTa-MoE
(32GPUs)
GPT-MoE(32GPUs)
Swin-MoE-L(32GPUs)
All-to-AllOthers.
(b)32GPUs(double#GPUs)
100%
80%
60%
40%
20%0%
RoBERTa-MoEGPT-MoESwin-MoE-L
-Wide(16GPUs)-Wide(16GPUs)-Wide(16GPUs)
All-to-AllOthers.
(c)16GPUs(double#experts)
Figure3:Proportionofall-to-allcommunicationtimerelativetototaltrainingdurationacrossdifferentconfigurations:scalingthenumberoftrainingservers(Figure
3(b))andscalingtheparameter
sizeofmodels(Figure
3(c))
.
distributedcomputing.Specifically,theMoElayerharborsaconsiderablyhighernumberofparam-etersandrequiresadditionalmemory,yetitmaintainsalmostthesamecomputationaldemandsasthedenselayer.Thisleadstoauniquecomputedensity—definedastheratioofthelayer’sFLOPs(FloatingPointOperations)toitsnumberofparameters.Therefore,traditionalparallelismmethodssuchastensorparallelismandpipelineparallelismareinsufficientforachievingeffectiveparallelisminthescenariosofMoEtraining.
Toimprovetheefficiencyandscalabilityoftraininglarge-scaleMoEmodels,expertparallelism
[17]
hasbeenintroducedasaspecializedmodelparallelismstrategy.ThisapproachdistributesexpertswithinanMoElayeracrossmultipleGPUs,whileleveragingdataparallelismforreplicatingnon-MoElayers,thusefficientlymanagingthetrainingworkloadofMoEmodels.TheworkflowofdistributedtrainingforanMoElayerisdepictedinFigure
2.
Oncethetargetexpertforeachtokenisdetermined,anall-to-allcommunicationprocessistriggeredtodistributetokenstotheircorrespondingtargetexpertsforcomputations,denotedasEi(x).Subsequently,anotherroundofall-to-allcommunicationisexecutedtogathertheoutputsfromallexperts,whichproducestheMoElayer’soutput(representedasf(x),Equation
2)
.Subsequentoperationsinvolveexecutingthedata-parallelnon-MoElayers.
WefirstprofiledthetrainingprocessofthreepopularMoEmodelsemployingexpertparallelism(detailedinTable
1)onaclustercomprisedoffourA100machines,eachequippedwithaninter
-connectRDMAbandwidthof200Gb/s.Theproportionofall-to-allcommunicationtimerelativetothetotaltrainingdurationisillustratedinFigure
3(a).
Wethendoublethenumberofmachines,andthenumberofexpertstoincreasethemodelscale.TheresultsareshowninFigure
3(b)
and
3(c),respectively.
Ourfindingsrevealthatall-to-allcommunicationaccountedforasubstantialportionofthetotaltime:approximately30%inGPT-MoE(15B),40%inRoBERTa-MoE,and70%inSwin-MoE-L.Andthisoverheadremainsnearlyconstantinlargermodelsandatlargermachinescales.Theseresultshighlightasignificantbottleneckthathampersthescalabilityofthetrainingprocess.Consequently,thedurationofall-to-allcommunicationsubstantiallyconstrainstrainingwithexpertparallelism,leadingtoreducedoverallthroughputandlimitingthepotentialtoscaleupthenumberofexpertseffectively.
2.3Locality-SensitiveHashingAlgorithms
Locality-SensitiveHashing(LSH)isaprobabilisticmethodprimarilyusedtoapproximatenearestneighborsearchinhigh-dimensionalspaces,whichreducesthedimensionalityofdatabymappingsimilardatatothesame“buckets”withhighprobabilityusinghashfunctions.Thisapproachoffersasubstantialreductionincomputationalcomplexity,particularlybeneficialforlarge-scaledataapplications.ThekeyoperationsinLSHincluding:
MappingDataintoBuckets:ThecoreofLSHisafamilyofhashfunctionsthatmaximizetheprobabilityofnearbypointsintheoriginalspacestayingcloseinthehashedspace,whiledistantpointsarelikelytoendupindifferentbuckets.Eachhashfunctionhischaracterizedbytheproperty:P[h(x)=h(y)]=1?d(x,y)/D,whered(x,y)isthedistancebetweenpointsxandy,andDdenotesthediameterofthespace.Tomapsimilardataintothesamebucket,multiplehashfunctionsfromthisfamilyareselectedbasedonthespecificattributesofthedata(e.g.,Euclideandistance,cosinesimilarity)andthedesiredgranularityofthebuckets.Datapointsarethenhashedbythese
5
residuals
device0
廠
expert0
tokenscentroidsE(centroids)E(tokens)
3
4
1
2
expert1
All-To-All
All-To-All
LSH-BasedClustering
device1
Residual-basedErrorCompensation
expert2
tokenscentroidsresiduals
?=
E(centroids)residualsE(tokens)
+=
calculatethecentroids
average(1,,5)=
a
2
device2
buckets
centroids
a
b
hashfunction
tokenshash
1
2
3
4
5
6
Figure5:SchematicofMoEtrainingwithLocality-SensitiveHashing(LSH-MoE).
functions,andeachpointisassignedtobucketsaccordingtoitshashvalues,effectivelycategorizingsimilaritemstogetherforclustering.
CalculatingClusterCentroids:Bygroupingdatapointsintobucketsasdeterminedbytheirhash
values,datapointsareeffectivelyclustered.Eachbucketrepresentsaclusterofdatapointsandthe
centroidofeachclusteristhencalculatedasthemeanofallpointswithinthatcluster,formulatedas
Cj=ε1xi,whereCjisthecentroidofthej-thbucket,njisthenumberofpointsinthej-th
bucket,andxiarethedatapointsinthebucket.
3Methodology
3.1TheMotivationofTokenSimilarity
Toexplorethepotentialoptimizationforall-to-allcommunicationsinMoEtraining,weconductedanin-depthanalysisofthedatainvolvedintheseall-to-allcommunications,identifyingahighdegreeofsimilarity,termedtokensimilarity.Specifically,weappliedPrincipalComponentAnalysis(PCA)toreducethedimensionalityoftheinputtokensofall-to-allcommunicationsandobservedadistinctclusteringphenomenon,asillustratedintheFigure
4.
Ouranalysissuggeststhattheobservedsimilarityamongtokensmaystemfromtwoprimaryfactors:
?DataRelatedInfluences:Thesimilarityispartiallyduetothenatureofreal-worlddata,whichoftenadherestoZipf’s
Law[18]
.Thisresultsinaskeweddistribution,withcertaindataelementsappearmorefrequentlythanothers.
?ModelStructureRelatedInfluences:ThedesignofTrans-
formerarchitecture[34],especiallyitsattentionmecha
-nisms,significantlyimpactstokensimilarity.Inmodels
likeBERT[7],attentionlayersaredesignedtocaptureand
integratecontextinformationacrosstokens,thushomoge-nizingtokenrepresentationsandemphasizingtheirsharedsemanticrelationshipsatthesentencelevel.
Figure4:PrincipalCom-ponentAnalysis(PCA)Visu-alizationofinputtokensin-volvedinall-to-allcommuni-cation.
6
3.2LSH-MoE
MotivatedbytheTokenSimilarityobservedinSection
3.1,weintroduce
LSH-MoE,anovelMoE
trainingframeworkthatintegrateslocality-sensitivehashing(LSH)forrapidclusteringofinputtokens.Ourmethodtransmitsonlytheclusteringcentroids,significantlyreducingcommunicationvolumes.Tocounteractthenegativeeffectsofcompression,wealsoimplementaresidual-basederrorcompensationscheme.
AsdepictedinFigure
5,
LSH-MoEinitiallyemploys(1)anLSH-basedclusteringmethodtocompresstokensintocentriodsforsubsequentprocessing,effectivelyreducingcommunicationoverhead.Itthensequentiallyexecutes(2)all-to-allcommunication,expertcomputation,andanother(3)all-to-allcommunicationtoproducetheprocessedoutputsE(centriods).Finally,itintroduces(4)aresidual-basederrorcompensationmethodtoapproximatetheexpert-processedresultsE(tokens),byintegratingE(centriods)withresiduals.Meanwhile,wealsooutlinetheworkflowofourLSH-MoEframeworkintheAlgorithm
1
ofAppendix
A.1.
ThekeycomponentsofourLSH-MoEframeworkincludesanefficientLSH-basedclusteringalgorithmforrapidprocessingandanresidual-basederrorcompensationschemetominimizequalitydegradation.
EfficientLSH-basedClusteringAlgorithm.Sincethedatatobecompressed(theinputdataforall-to-allcommunication)isgenerateddynamicallyandinrealtime,pre-compressingitoroverlappingcompressiontimewithotherprocessingtasksisnotfeasible.Consequently,selectinganefficientonlinecompressionalgorithmiscrucial.Traditionalclusteringalgorithms,suchasK-Means,oftenencountercomputationalchallengesandefficiencylimitations.Locality-sensitivehashing(LSH)addresstheseissuesbyhashingsimilardatapointsintothesamebuckets,enablingfastersimilaritydetectioninhigh-dimensionalspaces.
NumerousLSHalgorithmshavebeendeveloped,eachemployingauniquehashingapproachformappingdataontobuckets.Weconductedexperimentstoevaluateseveralpopularhashingalgorithms,includingcross-polytopehashingandsphericalhashing.BasedonourevaluationsinSection
4.5,we
selectedcross-polytopehashingastheoptimalalgorithmforourapplication.Cross-polytopehashingstandsoutforitsmethodofmappinginputvectorstothenearestvertexonacross-polytope.Thisprocessisfacilitatedbyapplyingrandomlyrotatedcross-polytopes,whicheffectivelysegmentthesurfaceoftheunitsphere.Thealgorithmcanbemathematicallyrepresentedasfollows:
LSH(x)=argmaxi∈{±1,±2,...,±d}|Rx|i(3)
whereRisarandomrotationmatrix,disthedimensionalityofthespace,and|Rx|idenotestheabsolutevalueofthei-thcomponentoftherotatedvectorRx.
ThisformulaencapsulateshowtheinputvectorxistransformedbytherotationmatrixRandthenmappedtothenearestvertexofthecross-polytopebyselectingthedimensionithatmaximizestheabsolutevalueofthecomponentsofRx.Thismethodeffectivelysegmentsthehigh-dimensionalspaceandenhancestheclusteringefficiencybyrapidlyidentifyingsimilardatapoints.
Residual-basedErrorCompensationScheme.InourLSH-MoEframework,wecompresstheintermediateactivationvalueswithinthemodelnetwork.Unlikegradientcompression,thisprocessdoesnottolerateerrorswell.Therefore,itisessentialtominimizecompression-inducederrorstoensureminimalimpactonmodelperformance.Toaddressthis,weimplementanovelresidual-basedgradientcompensationstrategy,outlinedasfollows:
1.Wefirstcapturetheresidualforeachdatapointrelativetoitsclustercentroid,definedbytheequation:
2.Aftertheexpertnetworkcomputesoutputsfortheclustercenters,thefinalstepistorestoretheprocessedresultforeachtokenbyaddingbackthepreviouslyrecordedresidual:
Yij←{E(clusterj)+?Clusterjk|k=1,2,...,Nj}.(5)
Thiserrorcompensationschemeeffectivelymitigatespotentialaccuracylosscausedbydatacompres-sioninall-to-allcommunication,ensuringthefidelityandrobustnessoftheLSH-MoEframework.TheexperimentalresultsinSection
4
showthatimplementingthiscompensationmechanismenables
7
Table1:Modelsforevaluation,where“-”indicatesthatthevaluesaredifferentacrosslayers.
Model
#Layer
dmodel
dffn
#Experts
#Params.(MoE)
#Params.(Total)
RoBERTa-MoE
12
768
3072
16
302M
394M
T5-MoE
16
1024
16384
16
8594M
9288M
GPT-MoE(15B)
12
768
3072
512
14507M
14629M
GPT-MoE(52B)
24
1024
4096
512
51539M
51740M
Swin-MoE-L
24
-
-
32
-
946M
themodeltrainedwithLSH-MoEtoachieveanaccuracycomparabletothatofamodeltrainedwithoutcompression.Thisoutcomehighlightstheeffectivenessofourproposederrorcompensationstrategyinpreservingmodelperformancedespitethechallengesposedbydatacompressioninall-to-allcommunication.
3.3ScalabilityAnalysisofLSH-MoE
Toeffectivelydemonstratethescalabilityofourapproach,particularlyintermsofitsapplicabilitytobothlargermodelsandlargercomputationalclusters,weconductedatheoreticalanalysis.ThisanalysisprimarilyfocusesonthecomputationoverheadandthecommunicationcostsassociatedwithMixtureofExperts(MoE),specificallyconsideringall-to-allcommunicationoverhead.Wederivedtheratioofcommunicationtimetocomputationtime,highlightinghowthisratioevolvesasboththescaleoftheserversandthemodelsizeincrease.Thisrelationshipiscrucialforunderstandingscalabilityandcanbeformallyexpressedasfollows:
wherekrepresentsthenumberofexpertsactivatedpertoken,FLOPsandBinterdenotetheGPU’scomputationabilityandthenetworkperformance,wisthenumberofGPUservers,andhisthehiddensizeofmodel.Notably,thefirstterm,.Additionally,scalingMoEmodelstypicallyemphasizesincreasingthenumberoflayersandexperts,
whilethegrowthinhiddensize(h)tendstobegradual,asseeninmodelslikeSwitch-Transformer[9]
.Consequently,whenboththemodelscaleandthenumberoftrainingserversgrow,theproportionofall-to-allcommunicationtimeremainsnearlyunchanged.ThisinsightunderpinsthescalabilityoftheLSH-MoEmethod,demonstratingitsrobustnessinlarger-scalesettingsandsupportingitspotentialinfuturelarge-scaleapplications.Foradetailedderivation,pleaserefertoAppendix
A.2.
4Experiment
4.1Implementation
OurLSH-MoEcomprisesadatacompression/restorationcomponentandacommunicationcompo-nent.WeutilizePyTorch1.11fordevelopingtheLSHclusteringandNCCLforimplementingthecommunication.Additionally,ourmethodisframework-independentandcanbeeasilyappliedto
otherMoEtrainingframeworkssuchasHetu-MoE[21,
26],DeepSpeed-MoE[29],andTutel[12]
.
4.2BenchmarksandDatasets
Ourevaluationsareconductedbyscalingpre-trainedmodelsequippedwithMoEarchitectureacrossvariousapplicationdomains.ThisincludesmodelslikeRoBERTa-MoE,T5-MoEandGPT-MoEinnaturallanguageprocessing(NLP),aswellasSwin-MoEincomputervision(CV).Amongthesemodels,RoBERTa-MoEandT5-MoEareevaluatedonpre-trainingtask,whileGPT-MoEandSwin-MoEundergofine-tuningevaluationbasedontheirofficialopen-sourcedmodelcheckpoints
1
2.
Wealsoevaluatedthezero-shotaccuracyofthepre-trainedT5-MoE.ModelconfigurationsaredetailedinTable
1.
1/facebookr
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- GB/T 45155-2024質(zhì)量管理理解、評(píng)價(jià)和改進(jìn)組織的質(zhì)量文化指南
- Perfluoropentane-Dodecafluoropentane-生命科學(xué)試劑-MCE-3888
- Ergocornine-生命科學(xué)試劑-MCE-6625
- 10-Norparvulenone-生命科學(xué)試劑-MCE-1894
- 二零二五年度智能制造股權(quán)融資協(xié)議
- 二零二五年度游戲軟件試用授權(quán)合同
- 二零二五年度企業(yè)退休人員再就業(yè)解除合同協(xié)議
- 2025年度貨運(yùn)駕駛員綠色出行與節(jié)能減排合同
- 2025年度新能源項(xiàng)目電力施工簡易協(xié)議書
- 2025年度豪華公寓私人房屋轉(zhuǎn)租管理服務(wù)合同
- 第十五章《探究電路》復(fù)習(xí)課課件滬科版九年級(jí)物理
- 2024年中考物理科技創(chuàng)新題型(教師版)
- 唐山市重點(diǎn)中學(xué)2024-2025學(xué)年全國高考大聯(lián)考信息卷:數(shù)學(xué)試題試卷(3)含解析
- 未成年上班知情協(xié)議書
- 2024年山東藥品食品職業(yè)學(xué)院單招職業(yè)適應(yīng)性測試題庫含答案
- 2023-2024學(xué)年高中政治統(tǒng)編版選擇性必修二7-1 立足職場有法寶 課件(34張)
- 2024年高考語文標(biāo)點(diǎn)符號(hào)的基本用法大全(新標(biāo)準(zhǔn))
- 恩施州巴東縣核桃樹煤礦有限公司核桃樹煤礦礦產(chǎn)資源開發(fā)利用與生態(tài)復(fù)綠方案
- 部編版語文一年級(jí)下冊全冊大單元整體作業(yè)設(shè)計(jì)
- 學(xué)生平板電腦使用規(guī)則
- 電子技術(shù)的發(fā)展和應(yīng)用
評(píng)論
0/150
提交評(píng)論