版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
Threading&SimultaneousMultithreading
SlidesadaptedfromDavidPatterson,UC-Berkeleycs252-s0612OutlineThreadLevelParallelismMultithreadingSimultaneousMultithreadingPower4vs.Power5HeadtoHead:VLIWvs.Superscalarvs.SMTCommentaryConclusion3PerformancebeyondsinglethreadILPILPforarbitrarycodeislimitednowto3to6issues/cycle,therecanbemuchhighernaturalparallelisminsomeapplications(e.g.,databaseorscientificcodes)Explicit(specifiedbycompiler)ThreadLevelParallelismorDataLevelParallelismThread:aprocesswithitsowninstructionsanddata(ormuchharderoncompiler:carefullyselectedcodesegmentsinthesameprocessthatrarelyinteract)Athreadmaybeoneprocessthatispartofaparallelprogramofmultipleprocesses,oritmaybeanindependentprogramEachthreadhasallthestate(instructions,data,PC,registerstate,andsoon)necessarytoallowittoexecuteDataLevelParallelism:Performidentical(lock-step)operationsondatawhenhavelotsofdata.4ThreadLevelParallelism(TLP)ILP(lastlectures)exploitsimplicitlyparalleloperationswithinalooporstraight-linecodesegmentTLPisexplicitlyrepresentedbytheuseofmultiplethreadsofexecutionthatareinherentlyparallelGoal:UsemanyinstructionstreamstoimproveThroughputofcomputersthatrunmanyprogramsExecutiontimeofmulti-threadedprogramsTLPcouldbemorecost-effectivetoexploitthanILPformanyapplications.5NewApproach:MultithreadedExecutionMultithreading:multiplethreadstosharethefunctionalunitsofoneprocessorviaoverlappedexecutionprocessormustduplicateindependentstateofeachthread,e.g.,aseparatecopyoftheregisterfile,aseparatePC,andifrunningasindependentprograms,aseparatepagetablememorysharedthroughthevirtualmemorymechanisms,whichalreadysupportmultipleprocessesHWforfastthreadswitch(0.1to10clocks)ismuchfasterthanafullprocessswitch(100sto1000sofclocks)thatcopiesstate(state=registers,memory,andfileaccesstables)Whenswitchamongthreads?Alternateinstructionsfromnewthreads(finegrain)Whenathreadisstalled,perhapsforacachemiss,anotherthreadcanbeexecuted(coarsegrain)Incache-lessmultiprocessors,atstartofeachmemoryaccess6Formostapplications,theprocessingunit(s)stall80%ormoreoftimeduring“execution”From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism,ISCA1995.(FromUWash.)Just18%ofissueslotsOKforan8-waysuperscalar.<=#1<=#218
18%CPUissueslots
usefullybusy7MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%8Fine-GrainedMultithreadingSwitchesbetweenthreadsoneachinstructioncycle,causingtheexecutionofmultiplethreadstobeinterleavedUsuallydoneinaround-robinfashion,skippinganystalledthreadsCPUmustbeabletoswitchthreadseveryclockAdvantageisthatitcanhidebothshortandlongstalls,sinceinstructionsfromotherthreadsexecutedwhenonethreadstallsDisadvantageisitslowsdownexecutionofindividualthreads,sinceathreadreadytoexecutewithoutstallswillbedelayedbyinstructionsfromotherthreadsUsedonSun’sNiagarachip(with8cores,willseelater)9Course-GrainedMultithreadingSwitchesthreadsforcostlystalls,suchasL2cachemisses(oronanydatamemoryreferenceifnocaches)AdvantagesRelievesneedtohaveveryfastthread-switching(ifusecaches).Doesnotslowdownanythread,sinceinstructionsfromotherthreadsissuedonlywhenactivethreadencountersacostlystall
Disadvantageisthatitishardtoovercomethroughputlossesfromshorterstalls,becauseofpipelinestart-upcostsSinceCPUnormallyissuesinstructionsfromjustonethread,whenastalloccurs,thepipelinemustbeemptiedorfrozenNewthreadmustfillpipelinebeforeinstructionscancompleteBecauseofthisstart-upoverhead,coarse-grainedmultithreadingisefficientforreducingpenaltyonlyofhighcoststalls,wherestalltime>>pipelinerefilltimeUsedIBMAS/400(1988,forsmalltomediumbusinesses)10(UWash=>Intel)SimultaneousMulti-threading…
“Hyper-threading”123456789MMFXFXFPFPBRCCCycleOnethread,8funcunitsM=Load/Store,FX=FixedPoint,FP=FloatingPoint,BR=Branch,CC=ConditionCodes123456789MMFXFXFPFPBRCCCycleTwothreads,8unitsBusy:13/72=18.0%Busy:30/72=41.7%11UsebothILPandTLP?(UWash:“Yes”)TLPandILPexploittwodifferentkindsofparallelstructureinaprogramCouldaprocessororientedtowardILPbeusedtoexploitTLP?functionalunitsareoftenidleindatapathsdesignedforILPbecauseofeitherstallsordependencesinthecodeCouldtheTLPbeusedasasourceofindependentinstructionsthatmightkeeptheprocessorbusyduringstalls?CouldTLPbeusedtoemploythefunctionalunitsthatwouldotherwiselieidlewheninsufficientILPexists?
12SimultaneousMultithreading(SMT)Simultaneousmultithreading(SMT):insightthatadynamicallyscheduledprocessoralreadyhasmanyHWmechanismstosupportmultithreadingLargesetofvirtualregistersthatcanbeusedtoholdtheregistersetsofindependentthreadsRegisterrenamingprovidesuniqueregisteridentifiers,soinstructionsfrommultiplethreadscanbemixedindatapathwithoutconfusingsourcesanddestinationsacrossthreadsOut-of-ordercompletionallowsthethreadstoexecuteoutoforder,andgetbetterutilizationoftheHWJustneedtoaddaper-threadrenamingtableandkeepingseparatePCsIndependentcommitmentcanbesupportedby“l(fā)ogically”keepingaseparatereorderbufferforeachthreadSource:MicrprocessorReport,December6,1999
“CompaqChoosesSMTforAlpha”13DesignChallengesinSMTSinceSMTmakessenseonlywithfine-grainedimplementation,impactoffine-grainedschedulingonsinglethreadperformance?Doesdesignatingapreferredthreadallowsacrificingneitherthroughputnorsingle-threadperformance?Unfortunately,withapreferredthread,processorislikelytosacrificesomethroughputwhenthepreferredthreadstallsLargerregisterfileisneededtoholdmultiplecontextsTrynottoaffectclockcycletime,especiallyinInstructionissue-morecandidateinstructionsneedtobeconsideredInstructioncompletion-choosingwhichinstructionstocommitmaybechallengingEnsurethatcacheandTLBconflictsgeneratedbySMTdonotdegradeperformance14MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%15Power4Single-threadedpredecessortoPower5.Eightexecutionunitsinanout-of-orderengine,eachunitmayissueoneinstructioneachcycle.Instructionpipeline(IF:instructionfetch,IC:instructioncache,BP:branchpredict,D0:decodestage0,Xfer:transfer,GD:groupdispatch,MP:mapping,ISS:instructionissue,RF:registerfileread,EX:execute,EA:computeaddress,DC:datacaches,F6:six-cyclefloating-pointexecutionpipe,Fmt:dataformat,WB:writeback,andCP:groupcommit)16Power4-1threadPower5-2threads2fetch(PC),
2initialdecodes2completes(architectedregistersets)See/servers/eserver/pseries/news/related/2004/m2040.pdfPower5instructionpipeline(IF=instructionfetch,IC=instructioncache,BP=branchpredict,D0=decodestage0,Xfer=transfer,GD=groupdispatch,MP=mapping,ISS=instructionissue,RF=registerfileread,EX=execute,EA=computeaddress,DC=datacaches,F6=six-cyclefloating-pointexecutionpipe,Fmt=dataformat,WB=writeback,andCP=groupcommit)Page43.17Power5dataflow...Whyonly2threads?With4,somesharedresource(physicalregisters,cache,memorybandwidth)wouldoftenbottleneck
LSU=load/storeunit,FXU=fixed-pointexecutionunit,FPU=floating-pointunit,BXU=branchexecutionunit,andCRL=conditionregisterlogicalexecutionunit.18Power5threadperformance...Relativepriorityofeachthreadcontrollableinhardware.Forbalancedoperation,boththreadsrunslowerthanifthey“owned”themachine.19ChangesinPower5tosupportSMTIncreasedassociativityofL1instructioncacheandtheinstructionaddresstranslationbuffersAddedperthreadloadandstorequeuesIncreasedsizeoftheL2(1.92vs.1.44MB)andL3cachesAddedseparateinstructionprefetchandbufferingperthreadIncreasedthenumberofvirtualregistersfrom152to240IncreasedthesizeofseveralissuequeuesThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport20InitialPerformanceofSMTPentium4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_ratePentium4isdual-threadedSMTSPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmarkRunningonPentium4witheachof26SPECbenchmarkspairedwitheveryother(26*26runs)gavespeed-upsfrom0.90to1.58;averagewas1.20Power5,8processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_ratePower5running2“same”copiesofeachapplicationgavespeedupsfrom0.89to1.41,comparedto1.01and1.07averagesforPentium4.MostgainedsomeFloatingPt.applicationshadmostcacheconflictsandleastgains21ProcessorMicroarchitectureFetch/Issue/ExecuteFunct.UnitsClockRate(GHz)Transis-tors
DiesizePowerIntelPentium4ExtremeSpeculativedynamicallyscheduled;deeplypipelined;SMT3/3/47int.1FP3.8125M122mm2115WAMDAthlon64FX-57Speculativedynamicallyscheduled3/3/46int.3FP2.8114M
115mm2104WIBMPower5
(1CPUonly)Speculativedynamicallyscheduled;SMT;
2CPUcores/chip8/4/86int.2FP1.9200M300mm2(est.)80W(est.)IntelItanium2Staticallyscheduled
VLIW-style6/5/119int.2FP1.6592M423mm2130
WHeadtoHeadILPcompetition22PerformanceonSPECint200023PerformanceonSPECfp200024NormalizedPerformance:EfficiencyRankItanium2PentIum4AthlonPower5Int/Trans4213FP/Trans4213Int/area4213FP/area4213Int/Watt4312FP/Watt243125NoSilverBulletforILPNoobviousover-allleaderinperformanceTheAMDAthlonleadsonSPECIntperformancefollowedbythePentium4,Itanium2,andPower5Itanium2andPower5,whichperformsimilarlyonSPECFP,clearlydominatetheAthlonandPentium4onSPECFPItanium2isthemostinefficientprocessorbothforFl.Pt.andintegercodeforallbutoneefficiencymeasure(SPECFP/Watt)AthlonandPentium4bothmakegooduseoftransistorsandareaintermsofefficiency,IBMPower5isthemosteffectiveuserofenergyonSPECFPandessentiallytiedonSPECINT26LimitstoILPDoublingissueratesabovetoday’s3-6instructionsperclock,sayto6to12instructions,probablyrequiresaprocessortoissue3or4datamemoryaccessespercycle,resolve2or3branchespercycle,renameandaccessmorethan20registerspercycle,andfetch12to24instructionspercycle.Thecomplexitiesofimplementingthesecapabilitiesislikelytomeansacrificesinthemaximumclockrate
E.g,widestissueprocessoristheItanium2,butitalsohastheslowestclockrate,despitethefactthatitconsumesthemostelectricalpower!27MosttechniquesforincreasingperformanceincreasepowerconsumptionThekeyquestioniswhetheratechniqueisenergyefficient:doesitincreaseperformancefasterthanitincreasespowerconsumption?Multipleissueprocessortechniquesallareenergyinefficient:Issuingmultipleinstructionsincurssomeoverheadinlogicthatgrowsfaster(I2)thantheissuerategrowsGrowinggapbetweenpeakissueratesandsustainedperformanceNumberoftransistorsswitching=f(peakissuerate),andperformance=f(sustainedrate),
growinggapbetweenpeakandsustainedperformance
increasingenergyp
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 《噪聲污染防治法》課件
- 網(wǎng)店美工模擬題+答案
- 吉林省長春市公主嶺市2023-2024學(xué)年七年級上學(xué)期期末模擬考試數(shù)學(xué)試卷(含答案)
- 養(yǎng)老院老人心理咨詢師福利待遇制度
- 養(yǎng)老院老人精神文化生活指導(dǎo)制度
- 《關(guān)于液氨的講課》課件
- 2024年環(huán)境檢測外包服務(wù)合同
- 房屋無償協(xié)議書(2篇)
- 《增值的戰(zhàn)略評估》課件
- 2025年上饒貨運從業(yè)資格證模擬考
- 《招商銀行轉(zhuǎn)型》課件
- 靈新煤礦職業(yè)病危害告知制度范文(2篇)
- 2024年安徽省廣播電視行業(yè)職業(yè)技能大賽(有線廣播電視機線員)考試題庫(含答案)
- 山東省濟南市濟陽區(qū)三校聯(lián)考2024-2025學(xué)年八年級上學(xué)期12月月考語文試題
- 手術(shù)室的人文關(guān)懷
- 2024合作房地產(chǎn)開發(fā)協(xié)議
- 農(nóng)貿(mào)市場通風(fēng)與空調(diào)設(shè)計方案
- 第25課《周亞夫軍細(xì)柳》復(fù)習(xí)課教學(xué)設(shè)計+2024-2025學(xué)年統(tǒng)編版語文八年級上冊
- 2024年廣東省深圳市中考英語試題含解析
- 金蛇納瑞2025年公司年會通知模板
- 有限空間應(yīng)急預(yù)案演練方案及過程
評論
0/150
提交評論