教學(xué)第四章指令并行軟件方面課件_第1頁
教學(xué)第四章指令并行軟件方面課件_第2頁
教學(xué)第四章指令并行軟件方面課件_第3頁
教學(xué)第四章指令并行軟件方面課件_第4頁
教學(xué)第四章指令并行軟件方面課件_第5頁
已閱讀5頁,還剩43頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

ComputerArchitecture

----AQuantitativeApproach計算機體系結(jié)構(gòu)計算機體系結(jié)構(gòu)Chapter4(2)

Instruction-LevelParallelism

SoftwareApproaches 王奕Estelle.ywang@ComputerArchitecture

----AQLectureforILP:

Softwareapproaches(軟件方法)BasicCompilerTechniqueforExposingILPLoopunrolling(基本的發(fā)現(xiàn)ILP的編譯技術(shù)是循環(huán)展開)StaticBranchPrediction(靜態(tài)分支預(yù)測)StaticmultipleIssue:VLIW(靜態(tài)多指令發(fā)射VLIW)AdvancedCompilorSupportforExposingandExploitingILP(對發(fā)現(xiàn)和開發(fā)ILP的高級編譯器支持)Softwarepipelining(軟件流水)GlobalCodescheduling(全局代碼調(diào)度)HardwareSupportforExposingMoreParallelismatcompiletime(對編譯時開發(fā)ILP的硬件支持)ConditionalorPredicated(斷言的)instructions(條件指令或預(yù)測指令)Compilerspeculationwithhardwaresupport(在硬件支持下的編譯器投機技術(shù))LectureforILP:

SoftwareapprFPLoop:WherearetheHazards?Loop: LD F0,0(R1) ;F0=vectorelement ADDDF4,F0,F2 ;addscalarfromF2 SD 0(R1),F4 ;storeresult SUBI R1,R1,8 ;decrementpointer8B(DW) BNEZ R1,Loop ;branchR1!=zero NOP ;delayedbranchslotAssumptionsofthelatencyoftheFPoperations:Instruction Instruction Latency

producingresult usingresult incyclesFPALUop AnotherFPALUop 3FPALUop Storedouble 2Loaddouble FPALUop 1Loaddouble Storedouble 0Integerop Integerop 0

Wherearethestalls?FPLoop:WherearetheHazardsReducingstallsfromschedullinginBBanddelayedbranchLoop:LDF0,0(R1)ADDDF4,F0,F2SD0(R1),F4SUBIR1,R1,#8BNEZR1,LoopFDXMWFDsA1A2A3A4WFsDssXMWFssDXMWFsDXMW

10CCFFLoop:LDF0,0(R1)SUBIR1,R1,#8ADDDF4,F0,F2BNEZR1,Loop

SD+8(R1),F4FDXMWFDXM

WFDA1A2A3A4WFDXMW

FDsXMW

6CCF

s

DXMWReducingstallsfromschedulliUnrollLoopFourTimes(straightforwardway)

Rewritelooptominimizestalls?1Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;dropSUBI&BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;dropSUBI&BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;dropSUBI&BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SUBI R1,R1,#32 ;alterto4*8/////////////////13 SD +8(R1),F1614 BNEZ R1,LOOP15 NOP

15+4x(1+2)=27clockcycles,or6.8periterationAssumesR1ismultipleof41cyclestall2cyclesstallUnrollLoopFourTimes(straigUnrolledLoopThatMinimizesStallsWhatassumptionsmadewhenmovedcode?OKtomovestorepastSUBIeventhoughchangesregisterOKtomoveloadsbeforestores:getrightdata?Whenisitsafeforcompilertodosuchchanges?1Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD +16(R1),F1213 BNEZ R1,LOOP14 SD 8(R1),F16 ;8-32=-24

14clockcycles,or3.5periterationUnrolledLoopThatMinimizesSUsingLoopunrollingandschedulingwithstaticMultipleIssueIntegerInstructionFPinstructionClockcycleLoop:L.DF0,0(R1)1L.DF0,-8(R1)2L.DF0,-16(R1)ADD.DF4,F0.F23L.DF0,-24(R1)ADD.DF8,F6.F24L.DF0,-32(R1)ADD.DF12,F10.F25S.DF4,0(R1)ADD.DF16,F14.F26S.DF8,-8(R1)ADD.DF20,F18.F27S.DF12,-16(R1)8DADDUIR1,R1,#-409S.DF16,16(R1)10BNER1,R2,Loop11S.DF20,8(R1)12UsingLoopunrollingandschedStaticBranchPrediction

靜態(tài)分支預(yù)測Staticbranchpredictorsareusedinprocessorswhenbranchbehaviorisexpectedhighlypredictableatcompiletime.(靜態(tài)分支預(yù)測一般用于分支行為在編譯器時就具有很高有可預(yù)測性的情形)SeveraldifferentmethodsAlwayspredictabranchastakenoruntaken(總是預(yù)測轉(zhuǎn)移成功或不成功)Predictonthebasisofbranchdirection(基于轉(zhuǎn)移方向的預(yù)測)Backward-goingbranchtobetaken,(向后預(yù)測為成功)Forward-goingbranchtobenottaken.(向前預(yù)測為不成功)Profile-basedPrediction(基于以往概要信息(含多方面的行為)的預(yù)測)StaticBranchPrediction

靜態(tài)分支預(yù)StaticMultipleissue:VLIW

(靜態(tài)多發(fā)射:VLIW)VLIW:VeryLongInstructionWord(超長指令字)Each“instruction”hasexplicitcodingformultipleoperations(每條“指令”都顯式地包括多個操作)InEPIC,groupingcalleda“packet”InTransmeta,groupingcalleda“molecule”(with“atoms”asops)Tradeoffinstructionspaceforsimpledecoding

(為了編碼簡單,犧牲了一些代碼空間)ThelonginstructionwordhasroomformanyoperationsBydefinition,alltheoperationsthecompilerputsinthelonginstructionwordareindependent=>executeinparallelE.g.,2integeroperations,2FPops,2Memoryrefs,1branch16to24bitsperfield=>7*16or112bitsto7*24or168bitswideNeedcompilingtechniquethatschedulesacrossseveralbranchesStaticMultipleissue:VLIW

(靜LoopUnrollinginVLIWMemory Memory FP FP Int.op/ Clock

reference1 reference2 operation1 op.2 branchLDF0,0(R1) LDF6,-8(R1) 1LDF10,-16(R1) LDF14,-24(R1) 2LDF18,-32(R1) LDF22,-40(R1) ADDDF4,F0,F2 ADDDF8,F6,F2 3LDF26,-48(R1) ADDDF12,F10,F2 ADDDF16,F14,F2 4 ADDDF20,F18,F2 ADDDF24,F22,F2 5SD0(R1),F4 SD-8(R1),F8 ADDDF28,F26,F2 6SD-16(R1),F12 SD-24(R1),F16 7SD-32(R1),F20 SD-40(R1),F24 SUBIR1,R1,#48 8SD-0(R1),F28 BNEZR1,LOOP 9

Unrolled7timestoavoiddelays7resultsin9clocks,or1.3clocksperiteration(1.8X)Average:2.5opsperclock,50%efficiencyNote:NeedmoreregistersinVLIW(15vs.6inSS)LoopUnrollinginVLIWMemory ProblemsforVLIWTechnicalproblems(技術(shù)問題)Increaseincodesize(代碼的增長)LoopunrollingUnusedfunctionslotsLimitationsoflockstepoperation(鎖定同步操作的限制)AstallinanyfunctionunitmaycausetheentireprocessortostallLogisticalproblem(邏輯問題)Binarycodecompatibility(二進制代碼的兼容性)Majorchallengeforallmultiple-issueprocessorsExploitlargeamountsofILPProblemsforVLIWTechnicalproAdvancedCompilerSupportforExploitingILP(編譯器對開發(fā)ILP的高級支持)DetectingandEnhancingLoop-levelParallelism(檢測并增強循環(huán)級并行)EliminatingDependentComputations(消除相關(guān)計算)Softwarepipelining:Symbolicloopunrolling(軟件流水:符號循環(huán)展開)GlobalCodeScheduling(全局代碼調(diào)度)TraceScheduling:focusonCriticalpath

(路徑調(diào)度:關(guān)注關(guān)鍵路徑)SuperblocksAdvancedCompilerSupportforDetectingandEnhancingLoop-levelParallelismLoop-carrieddependence(循環(huán)傳遞相關(guān)----存在循環(huán)之間的相關(guān)性)DataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterationsAloopisparallelifitcanbewrittenwithoutacycleinthedependences.(一個循環(huán)中,如果相關(guān)性沒有構(gòu)成一個環(huán),就說這個循環(huán)是可并行的)AnassumptionAllarrayindices(下標)areaffine(仿射的).Aone-dimensionalarrayindexisaffine,ifitcanbewrittenintheformofai+b.Adependenceexistsiftwoconditionshold(滿足下面兩條件,即相關(guān)存在):TwoindicesJ,K,withinthelimitsoftheloop.(下標的兩個取值,j,k)TheloopstoresintoE[aj+b]andlaterfetchfromthesameelementE[ck+d],itcansatisfyaj+b=ck+d(存數(shù)與讀取數(shù)下標滿足aj+b=ck+d)GCD(Greatestcommondivisor)test---最大公因子測試Ifaloop-carrieddependenceexists,thenGCD(c,a)mustdivide(d-b).(GCD(c,a)必須被(d-b)整除)DetectingandEnhancingLoop-lEliminatingDependentComputations--消除相關(guān)計算DADDUIR1,R2,#4DADDUIR1,R1,#4ADDR1,R2,R3ADDR4,R1,R6ADDR8,R4,R7SUM=SUM+XDADDUIR1,R2,#8ADDR1,R2,R3ADDR4,R6,R7ADDR8,R1,R4SUM=SUM+X1+X2+X3+X4+X5SUM=((SUM+X1)+(X2+X3))+(X4+X5)R8=R2+R3+R6+R7把R1與R7的位置對換EliminatingDependentComputatSoftwarePipelining-軟件流水Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferentiterations

(如果循環(huán)的迭代之間是不相關(guān)的,則可以從不同迭代中取指執(zhí)行可以獲得更多的可并行性)Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop(-TomasuloinSW)(軟件流水是從源循環(huán)的不同迭代體中取出必要的指令,重新建立新的循環(huán),提供連續(xù)指令給多發(fā)射處理器)SoftwarePipelining-軟件流水ObservSoftwarePipeliningExampleBefore:Unrolled3times

1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4

4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8

7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOPAfter:SoftwarePipelined

1 SD 0(R1),F4; StoresM[i]

2 ADDD F4,F0,F2; AddstoM[i-1]

3 LD F0,-16(R1); LoadsM[i-2]

4 SUBI R1,R1,#85 BNEZ R1,LOOPSymbolicLoopUnrollingMaximizeresult-usedistanceLesscodespacethanunrollingFill&drainpipeonlyonceperloop

vs.oncepereachunrollediterationinloopunrolling5cyclesperiterationSWPipelineLoopUnrolledoverlappedopsTimeTimeSoftwarePipeliningExampleBefTraceScheduling(路徑調(diào)度—專用于VLIW)ParallelismacrossIFbranchesvs.LOOPbranches

(挖掘跨越if轉(zhuǎn)移和LOOP轉(zhuǎn)移的并行性)Twosteps(路徑調(diào)度技術(shù)包含兩個獨立的處理過程)TraceSelection(路徑選擇)Findlikelysequenceofbasicblocks(trace—預(yù)測路徑)of(staticallypredictedorprofilepredicted)longsequenceofstraight-linecode(首先根據(jù)轉(zhuǎn)移行為預(yù)測轉(zhuǎn)移可能的兩個路徑方向,找出使用概率大的那個方向作為擴展基本塊的方向,這個方向的后繼指令稱為預(yù)測路徑)TraceCompaction(路徑壓縮)SqueezetraceintofewVLIWinstructions(將選定路徑上的操作封裝成超長指令)NeedbookkeepingcodeincasepredictioniswrongThisisaformofcompiler-generatedspeculationCompilermustgenerate“fixup(修正)”codetohandlecasesinwhichtraceisnotthetakenbranch(預(yù)測失效要采取補償措施)Needsextraregisters:undoesbadguessbydiscardingTraceScheduling(路徑調(diào)度—專用于VLIW)ExampleofTraceSchedulingExampleofTraceSchedulingExample原始代碼路徑調(diào)度之后的代碼Example原始代碼路徑調(diào)度之后的代碼AdvantagesofHW(Tomasulo)vs.SW(VLIW)SpeculationHWadvantages:HWbetteratmemorydisambiguation(內(nèi)存釋意)sinceknowsactualaddressesHWbetteratbranchpredictionsinceloweroverheadHWmaintainspreciseexceptionmodelHWdoesnotexecutebookkeepinginstructions(補償代碼)SamesoftwareworksacrossmultipleimplementationsSmallercodesize(notasmanynoopsfilingblankinstructions)SWadvantages:WindowofinstructionsthatisexaminedforparallelismmuchhigherMuchlesshardwareinvolvedinVLIW(unlessyouareIntel…!)MoreinvolvedtypesofspeculationcanbedonemoreeasilySpeculationcanbebasedonlarge-scaleprogrambehavior,notjustlocalinformationAdvantagesofHW(Tomasulo)vsSuperscalarv.VLIWSmallercodesize(較小的代碼長度)Binarycompatability(二進制代碼的兼容性好)acrossgenerationsofhardwareSimplifiedHardwarefordecoding,issuinginstructionsNoInterlockHardware(compilerchecks?)Moreregisters,butsimplifiedHardwareforRegisterPorts(multipleindependentregisterfiles?)Superscalarv.VLIWSmallercodHardwareSupportforExpoiltingILPatcompiletimeConditional/predicatedinstruction)(條件指令或預(yù)測指令)Aconditionalinstructionreferstoaconditionwhichisevaluatedaspartoftheinstructionexecution,(條件指令的條件判斷僅僅作為指令執(zhí)行的一部分)Example:If(A==0){S=T}BNEZR1,LCMOVR2,R3,

R1ADDUR2,R3,R0L:……

theCPUalwaysexecutestheinstructionbutwritestheresultonlyiftheconditionismet.

(CPU總是會執(zhí)行這條指令,但是否寫結(jié)果要看條件是否滿足)Aconditionalbranchchangesacontroldependenceintoadatadependence.(把控制相關(guān)轉(zhuǎn)成數(shù)據(jù)相關(guān))HardwareSupportforExpoiltinConditionalinstructionsTheexecutionofallinstructioniscontrolledbyapredicate.Whenpredicateisfalse,theinstructionbecomesano-opSimplyconvertsmallblocksofcodethatarebranchdependent.EliminatenonloopbranchesCanbeusedtospeculativelymoveaninstructionthatistimecritical.ConditionalinstructionsTheexCompilerSpeculationwithHardwareSupport--硬件支持的編譯投機Movespeculatedinstructionsnotonlybeforethebranch,butbeforetheconditionevaluation.Fourmethodsforsupportingambitious(大膽的)speculationHardwareandOScooperativelyignoreexceptionsforspeculativeinstructions.(硬件與OS協(xié)同忽略投機指令引起的異常中斷)Speculativeinstructionsthatneverraiseexceptionsareused.(調(diào)度那些不影響異常中斷行為的指令作為投機指令)

Poisonbitsareattachedtotheresultregisterswrittenbyspeculativeinstructions.(采用抑制位的投機技術(shù))Amechanismisprovidedtoindicatethataninstructionisspeculative,thehardwarebufferstheresultuntiltheinstructionnolongerspeculative.CompilerSpeculationwithHardComputerArchitecture

----AQuantitativeApproach計算機體系結(jié)構(gòu)計算機體系結(jié)構(gòu)Chapter4(2)

Instruction-LevelParallelism

SoftwareApproaches 王奕Estelle.ywang@ComputerArchitecture

----AQLectureforILP:

Softwareapproaches(軟件方法)BasicCompilerTechniqueforExposingILPLoopunrolling(基本的發(fā)現(xiàn)ILP的編譯技術(shù)是循環(huán)展開)StaticBranchPrediction(靜態(tài)分支預(yù)測)StaticmultipleIssue:VLIW(靜態(tài)多指令發(fā)射VLIW)AdvancedCompilorSupportforExposingandExploitingILP(對發(fā)現(xiàn)和開發(fā)ILP的高級編譯器支持)Softwarepipelining(軟件流水)GlobalCodescheduling(全局代碼調(diào)度)HardwareSupportforExposingMoreParallelismatcompiletime(對編譯時開發(fā)ILP的硬件支持)ConditionalorPredicated(斷言的)instructions(條件指令或預(yù)測指令)Compilerspeculationwithhardwaresupport(在硬件支持下的編譯器投機技術(shù))LectureforILP:

SoftwareapprFPLoop:WherearetheHazards?Loop: LD F0,0(R1) ;F0=vectorelement ADDDF4,F0,F2 ;addscalarfromF2 SD 0(R1),F4 ;storeresult SUBI R1,R1,8 ;decrementpointer8B(DW) BNEZ R1,Loop ;branchR1!=zero NOP ;delayedbranchslotAssumptionsofthelatencyoftheFPoperations:Instruction Instruction Latency

producingresult usingresult incyclesFPALUop AnotherFPALUop 3FPALUop Storedouble 2Loaddouble FPALUop 1Loaddouble Storedouble 0Integerop Integerop 0

Wherearethestalls?FPLoop:WherearetheHazardsReducingstallsfromschedullinginBBanddelayedbranchLoop:LDF0,0(R1)ADDDF4,F0,F2SD0(R1),F4SUBIR1,R1,#8BNEZR1,LoopFDXMWFDsA1A2A3A4WFsDssXMWFssDXMWFsDXMW

10CCFFLoop:LDF0,0(R1)SUBIR1,R1,#8ADDDF4,F0,F2BNEZR1,Loop

SD+8(R1),F4FDXMWFDXM

WFDA1A2A3A4WFDXMW

FDsXMW

6CCF

s

DXMWReducingstallsfromschedulliUnrollLoopFourTimes(straightforwardway)

Rewritelooptominimizestalls?1Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;dropSUBI&BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;dropSUBI&BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;dropSUBI&BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SUBI R1,R1,#32 ;alterto4*8/////////////////13 SD +8(R1),F1614 BNEZ R1,LOOP15 NOP

15+4x(1+2)=27clockcycles,or6.8periterationAssumesR1ismultipleof41cyclestall2cyclesstallUnrollLoopFourTimes(straigUnrolledLoopThatMinimizesStallsWhatassumptionsmadewhenmovedcode?OKtomovestorepastSUBIeventhoughchangesregisterOKtomoveloadsbeforestores:getrightdata?Whenisitsafeforcompilertodosuchchanges?1Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD +16(R1),F1213 BNEZ R1,LOOP14 SD 8(R1),F16 ;8-32=-24

14clockcycles,or3.5periterationUnrolledLoopThatMinimizesSUsingLoopunrollingandschedulingwithstaticMultipleIssueIntegerInstructionFPinstructionClockcycleLoop:L.DF0,0(R1)1L.DF0,-8(R1)2L.DF0,-16(R1)ADD.DF4,F0.F23L.DF0,-24(R1)ADD.DF8,F6.F24L.DF0,-32(R1)ADD.DF12,F10.F25S.DF4,0(R1)ADD.DF16,F14.F26S.DF8,-8(R1)ADD.DF20,F18.F27S.DF12,-16(R1)8DADDUIR1,R1,#-409S.DF16,16(R1)10BNER1,R2,Loop11S.DF20,8(R1)12UsingLoopunrollingandschedStaticBranchPrediction

靜態(tài)分支預(yù)測Staticbranchpredictorsareusedinprocessorswhenbranchbehaviorisexpectedhighlypredictableatcompiletime.(靜態(tài)分支預(yù)測一般用于分支行為在編譯器時就具有很高有可預(yù)測性的情形)SeveraldifferentmethodsAlwayspredictabranchastakenoruntaken(總是預(yù)測轉(zhuǎn)移成功或不成功)Predictonthebasisofbranchdirection(基于轉(zhuǎn)移方向的預(yù)測)Backward-goingbranchtobetaken,(向后預(yù)測為成功)Forward-goingbranchtobenottaken.(向前預(yù)測為不成功)Profile-basedPrediction(基于以往概要信息(含多方面的行為)的預(yù)測)StaticBranchPrediction

靜態(tài)分支預(yù)StaticMultipleissue:VLIW

(靜態(tài)多發(fā)射:VLIW)VLIW:VeryLongInstructionWord(超長指令字)Each“instruction”hasexplicitcodingformultipleoperations(每條“指令”都顯式地包括多個操作)InEPIC,groupingcalleda“packet”InTransmeta,groupingcalleda“molecule”(with“atoms”asops)Tradeoffinstructionspaceforsimpledecoding

(為了編碼簡單,犧牲了一些代碼空間)ThelonginstructionwordhasroomformanyoperationsBydefinition,alltheoperationsthecompilerputsinthelonginstructionwordareindependent=>executeinparallelE.g.,2integeroperations,2FPops,2Memoryrefs,1branch16to24bitsperfield=>7*16or112bitsto7*24or168bitswideNeedcompilingtechniquethatschedulesacrossseveralbranchesStaticMultipleissue:VLIW

(靜LoopUnrollinginVLIWMemory Memory FP FP Int.op/ Clock

reference1 reference2 operation1 op.2 branchLDF0,0(R1) LDF6,-8(R1) 1LDF10,-16(R1) LDF14,-24(R1) 2LDF18,-32(R1) LDF22,-40(R1) ADDDF4,F0,F2 ADDDF8,F6,F2 3LDF26,-48(R1) ADDDF12,F10,F2 ADDDF16,F14,F2 4 ADDDF20,F18,F2 ADDDF24,F22,F2 5SD0(R1),F4 SD-8(R1),F8 ADDDF28,F26,F2 6SD-16(R1),F12 SD-24(R1),F16 7SD-32(R1),F20 SD-40(R1),F24 SUBIR1,R1,#48 8SD-0(R1),F28 BNEZR1,LOOP 9

Unrolled7timestoavoiddelays7resultsin9clocks,or1.3clocksperiteration(1.8X)Average:2.5opsperclock,50%efficiencyNote:NeedmoreregistersinVLIW(15vs.6inSS)LoopUnrollinginVLIWMemory ProblemsforVLIWTechnicalproblems(技術(shù)問題)Increaseincodesize(代碼的增長)LoopunrollingUnusedfunctionslotsLimitationsoflockstepoperation(鎖定同步操作的限制)AstallinanyfunctionunitmaycausetheentireprocessortostallLogisticalproblem(邏輯問題)Binarycodecompatibility(二進制代碼的兼容性)Majorchallengeforallmultiple-issueprocessorsExploitlargeamountsofILPProblemsforVLIWTechnicalproAdvancedCompilerSupportforExploitingILP(編譯器對開發(fā)ILP的高級支持)DetectingandEnhancingLoop-levelParallelism(檢測并增強循環(huán)級并行)EliminatingDependentComputations(消除相關(guān)計算)Softwarepipelining:Symbolicloopunrolling(軟件流水:符號循環(huán)展開)GlobalCodeScheduling(全局代碼調(diào)度)TraceScheduling:focusonCriticalpath

(路徑調(diào)度:關(guān)注關(guān)鍵路徑)SuperblocksAdvancedCompilerSupportforDetectingandEnhancingLoop-levelParallelismLoop-carrieddependence(循環(huán)傳遞相關(guān)----存在循環(huán)之間的相關(guān)性)DataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterationsAloopisparallelifitcanbewrittenwithoutacycleinthedependences.(一個循環(huán)中,如果相關(guān)性沒有構(gòu)成一個環(huán),就說這個循環(huán)是可并行的)AnassumptionAllarrayindices(下標)areaffine(仿射的).Aone-dimensionalarrayindexisaffine,ifitcanbewrittenintheformofai+b.Adependenceexistsiftwoconditionshold(滿足下面兩條件,即相關(guān)存在):TwoindicesJ,K,withinthelimitsoftheloop.(下標的兩個取值,j,k)TheloopstoresintoE[aj+b]andlaterfetchfromthesameelementE[ck+d],itcansatisfyaj+b=ck+d(存數(shù)與讀取數(shù)下標滿足aj+b=ck+d)GCD(Greatestcommondivisor)test---最大公因子測試Ifaloop-carrieddependenceexists,thenGCD(c,a)mustdivide(d-b).(GCD(c,a)必須被(d-b)整除)DetectingandEnhancingLoop-lEliminatingDependentComputations--消除相關(guān)計算DADDUIR1,R2,#4DADDUIR1,R1,#4ADDR1,R2,R3ADDR4,R1,R6ADDR8,R4,R7SUM=SUM+XDADDUIR1,R2,#8ADDR1,R2,R3ADDR4,R6,R7ADDR8,R1,R4SUM=SUM+X1+X2+X3+X4+X5SUM=((SUM+X1)+(X2+X3))+(X4+X5)R8=R2+R3+R6+R7把R1與R7的位置對換EliminatingDependentComputatSoftwarePipelining-軟件流水Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferentiterations

(如果循環(huán)的迭代之間是不相關(guān)的,則可以從不同迭代中取指執(zhí)行可以獲得更多的可并行性)Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop(-TomasuloinSW)(軟件流水是從源循環(huán)的不同迭代體中取出必要的指令,重新建立新的循環(huán),提供連續(xù)指令給多發(fā)射處理器)SoftwarePipelining-軟件流水ObservSoftwarePipeliningExampleBefore:Unrolled3times

1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4

4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8

7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOPAfter:SoftwarePipelined

1 SD 0(R1),F4; StoresM[i]

2 ADDD F4,F0,F2; AddstoM[i-1]

3 LD F0,-16(R1); LoadsM[i-2]

4 SUBI R1,R1,#85 BNEZ R1,LOOPSymbolicLoopUnrollingMaximizeresult-usedistanceLesscodespacethanunrollingFill&drainpipeonlyonceperloop

vs.oncepereachunrollediterationinloopunrolling5cyclesperiterationSWPipelineLoopUnrolledoverlappedopsTimeTimeSoftwarePipeliningExampleBefTraceScheduling(路徑調(diào)度—專用于VLIW)ParallelismacrossIFbranchesvs.LOOPbranches

(挖掘跨越if轉(zhuǎn)移和LOOP轉(zhuǎn)移的并行性)Twosteps(路徑調(diào)度技術(shù)包含兩個獨立的處理過程)TraceSelection(路徑選擇)Findlikelysequenceofbasicblocks(trace—預(yù)測路徑)of(staticallypredictedorprofilepredicted)longsequenceofstraight-linecode(首先根據(jù)轉(zhuǎn)移行為預(yù)測轉(zhuǎn)移可能的兩個路徑方向,找出使用概率大的那個方向作為擴展基本塊的方向,這個方向的后繼指令稱為預(yù)測路徑)TraceCompaction(路徑壓縮)SqueezetraceintofewVLIWinstructions(將選定路徑上的操作封裝成超長指令)NeedbookkeepingcodeincasepredictioniswrongThisisaformofcompiler-generatedspeculationCompilermustgenerate“fixup(修正)”codetohandlecasesinwhichtraceisnotthetakenbranch(預(yù)測失效要采取補償措施)Needsextraregisters:undoesbadguessbydiscardingTraceScheduling(路徑調(diào)度—專用于VLIW)ExampleofTraceSchedulingExampleofTraceSchedulingExample原始代碼路徑調(diào)度之后的代碼Example原始代碼路徑調(diào)度之后的代碼AdvantagesofHW(Tomasulo)vs.SW(VLIW)SpeculationHWadvantages:HWbetteratmemorydisambiguation(內(nèi)存釋意)sinceknowsactualaddressesHWbetteratbranchpredictionsinceloweroverheadHWmaintainspreciseexceptionmodelHWdoesnotexecutebookkeepinginstructions(補償代碼)Samesoftwarewo

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論