三個(gè)臭皮匠勝過一個(gè)諸葛亮_第1頁
三個(gè)臭皮匠勝過一個(gè)諸葛亮_第2頁
三個(gè)臭皮匠勝過一個(gè)諸葛亮_第3頁
三個(gè)臭皮匠勝過一個(gè)諸葛亮_第4頁
三個(gè)臭皮匠勝過一個(gè)諸葛亮_第5頁
已閱讀5頁,還剩37頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

ExploitingThread-LevelParallelismin

GeneralPurposeApplicationsPen-ChungYew游本中

DepartmentofComputerScienceandEngineeringUniversityofMinnesotahttp:///Agassiz2023/1/111PCYew-Taiwan三個(gè)臭皮匠勝過一個(gè)諸葛亮

三個(gè)諸葛亮勝不過一個(gè)臭皮匠Pen-ChungYew游本中

DepartmentofComputerScienceandEngineeringUniversityofMinnesotahttp:///Agassiz2023/1/112PCYew-TaiwanImpactofHardwareTechnologyonComputerArchitecturesPerformance

improvementofmicroprocessorssofarhasbeendrivenprimarilybyhigherclockrates:

smallerfeaturesizes

(Moore’sconjecture),higherpowerdensity,highercoolingcostResults?IntelcancelledtwoPentiumprojectsrecentlyVLSItechnologyallowsmorethan1billiontransistorsonasinglechip=>plentyofgates,whattodowiththem?Superscalarishardtoscalebeyond~10instructionsperclockcycle:inherent

ILP(Instruction-LevelParallelism)limitationinapplicationprograms,longwiredelays,highpowerdensityMemorywallisgettinghigherbetweenCPUandstoragedevicesImprovingsingleprogramperformanceisstillveryimportant2023/1/113PCYew-TaiwanParallelProcessingComestotheRescue–Finally?Parallelprocessinghasbeenproposedtosalvageclockratelimitation=>forthepastthirtyyears!!!Finally?=>multiplecoresinIntel’sroadmapaswellasinmostembeddedprocessorstodayWhatisnewhere?

Usethread-levelparallelism(TLP)toimproveinstruction-levelparallelism(ILP)

forgeneral-purposeapplications2023/1/114PCYew-TaiwanILPvs.TLPTimeop1op2op3op4op5op6op7op8op9op10op11op12……………..op21op22op23op24t1t2t3t6Timet1t2t3t6op1op2op3op4op5op6op7op8op9op10op11op12…………op21op22op23op24Th1Th2Th3Th4SuperscalarTLPILP2023/1/115PCYew-TaiwanParallelProcessingComestotheRescue–Finally?IsthereenoughTLPingeneral-purposeapplicationprograms(toimproveILP)?=>muchharderthanscientificapplications(floating-point-intensive)2023/1/116PCYew-Taiwan2023/1/117PCYew-TaiwanTLPChallengesinGeneral-PurposeApplicationsMostlyDo-whileloopsNeedthread-levelcontrolspeculationParallelismexistsmostlyinouterloopsNotgoodforVLIW(i.e.softwarepipelining),orvectorprocessing=>needthread-levelsupportPointerscomplicatealiasanddatadependenceanalysisNeedruntimedisambiguationanddataspeculationManysmallloopsanddoacrossloopsNeedfastandlowoverheadcommunicationSmallbasicblocks–needtoexploitbothILPandTLP

Neednewapproachestoapplyparallelprocessingtosuchapplications!!2023/1/118PCYew-TaiwanOutlineMulti-threadedarchitecturesSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutiononmulti-threadedprocessorsProfile-basedanalysesConclusion2023/1/119PCYew-TaiwanMulti-ThreadedArchitectures

Toimprovesingle-programspeedupMultiscalarSuperthreaded

ProrcessorsTraceprocessorMultiprocessoronachipToimproveresourceutilization,throughputSimultaneousMultithreading(SMT)TohidememorylatencyTeracomputer,HyperthreadingTosupportsystem/applicationfunctionalityReference:SpeculativeExecutioninHighPerformanceComputerArchitectures,editedbyKaeliandYew,CRCPress,20052023/1/1110PCYew-TaiwanSuperthreadedArchitectures

Exploitthread-levelparallelismtoenhanceILPMultiplevs.singleinstructionwindows(notforscalabilityasintraditionalparallelprocessing)Controlspeculation(notstoppedbybranchinstructions)Dataspeculation(notstoppedbydatadep’sbetweenthreads)Fastcommunication=>smalltaskgranularityHighcachehitrates,automaticdataprefetchingNeednewhardwareandcompiler/softwaresupportReference:

TheSuperthreadedProcessorArchitecture,Tsai,etal

IEEETrans.OnComputers,Sept19992023/1/1111PCYew-Taiwan

InstructionCache

DataCacheThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnitThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnitThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnitThreadprocessingunitExecutionUnitComm.UnitMemoryBufferWritebackUnit2023/1/1112PCYew-TaiwanSpeculation:

BreakingProgramDependencyControlanddata

dependenceslimitprogramperformanceHowever,MostbrancheshavegoodpredictabilityMostdatadependences

happeninfrequently

atruntimeSpeculationisaneffectiveapproachtobreakdependencesOptimizeprogramexecutionbyignoringinfrequent

datadependences,ortakingpredictedpathsCheck

possibleviolation(mis-speculation)atruntimeRecoverifviolationoccurs2023/1/1113PCYew-TaiwanTypeofSpeculationControlspeculationSpeculateonprogramcontrolflowpathDataspeculationSpeculateonhowlikelymemoryreferencesaretothesamememorylocation(address)ValuespeculationSpeculationontheresultvalueofanoperation2023/1/1114PCYew-TaiwanOutlineMulti-treadedarchitecturesSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutioninmulti-threadedprocessorsProfile-basedanalysesConclusion2023/1/1115PCYew-TaiwanSpeculationonIntelIA64BothcontrolanddataspeculationaresupportedonIntelIA64SpecialinstructionsandhardwareareprovidedMemoryloadoperationistargetedforspeculationMemorydelayisusuallythebottleneckofperformanceMemoryloadisusuallythestartofspeculativeoperations2023/1/1116PCYew-TaiwanSpeculatingonDataDependence

MorespeculativeoptimizationsI1:…=*qI2:*p=bI3:…=*qI4:*r=…I5:…=*pI6:*r=…SpeculateonthisdependenceRedundancyeliminationopportunity2023/1/1117PCYew-TaiwanSpeculateonDataDependences

MorespeculativeoptimizationsI1:…=*qI2:*p=bI3:…=*qI4:*r=…I5:…=*pI6:*r=…SpeculateonthisdependenceCopypropagationopportunity2023/1/1118PCYew-TaiwanSpeculateonDataDependences

MorespeculativeoptimizationsI1:…=*qI2:*p=bI3:…=*qI4:*r=…I5:…=*pI6:*r=…SpeculateonthisdependenceDeadstoreeliminationopportunity2023/1/1119PCYew-TaiwanObservationsSpeculativeoptimizationopportunitiesexistinmanyapplications(originally,itwasonlyformemorylatencyhidingduringcodescheduling)AgeneralcompilerframeworkisneededtosupportbothcontrolanddataspeculationinoptimizationsNeedtogeneraterecoverycodeformis-speculationNeedextensivesupportfordatadependence,alias,andvalueprofiling

(nolongerconservativeanalysis)Reference:

ACompilerFrameworkforSpeculativeAnalysisandOptimizations,ACM/SIGPLANConf.OnProgrammingLanguageDesignandImplementation(PLDI),June2003,alsoinACMTrans.OnArchitectureandCodeOptimization(TACO),Vol.1,No.3,Sept.2004,pp.247-2712023/1/1120PCYew-TaiwanACompilerFramework:

IntelOpenResearchCompiler(ORC)2023/1/1121PCYew-TaiwanPerformanceImprovementofSpeculativeRegisterPromotionBasedonaliasprofileandcomparedwith–O3withtype-basedaliasanalysisonIntelORCcompiler2023/1/1122PCYew-TaiwanValueSpeculation

ValueLocality:likelihoodofapreviously-seenvaluerecurringwithinastoragelocationObservedinanystoragelocationsRegistersCachememoryMainmemoryMostworkfocussingonvaluestoredinregisterstobreakpotentialdatadependences:registervaluelocality2023/1/1123PCYew-TaiwanPerformanceofValuePredictorsPredictabilityofDataValues,SazeidesandSmith,Micro-30,1997Lastvaluepredictionvariesfrom23%to61%,averageabout40%Stridepredictionvariesfrom38%to80%,averageabout56%FCMwithanorderof3variesfrom56%toover90%,withanaverageofabout78%ImprovementdiminishesasorderincreasesLesssensitivetodifferenttypesofinstructions2023/1/1124PCYew-TaiwanOutlineIntroductionSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutiononmulti-threadedprocessorsProfile-basedanalysesConclusion2023/1/1125PCYew-TaiwanCompilerOptimizationsforSpeculativeThreadsWithoutcompileroptimization,thereislimitedTLPevenunderperfecthardwaresupport.[OplingerPACT99]CompilerhavetodecideWhichloops/regionstobetransformedintothreadUsesynchronizationorspeculationHowtoschedulethecodetoimproveoverlapsWhattransformationstobeusedWhen/HowtogeneraterecoverycodeProfile-basedanalysiscouldbeveryefficient2023/1/1126PCYew-TaiwanLoopSelectionprogramspeedupCarefullyselectedloopscanimproveperformancesignificantly!2023/1/1127PCYew-TaiwanSpeculativeCodeMotion*p=*p=*p==*p=*p=*p*p=*p=*p=

=*p=*p=*pstall

critical

pathother

computation

beforecodemotionaftercodemotion2023/1/1128PCYew-TaiwanOutlineIntroductionSpeculationtobreakdependencySpeculativeexecutiononsingle-threadedprocessorsSpeculativeexecutiononmulti-threadedprocessorsProfile-basedanalysisConclusion2023/1/1129PCYew-TaiwanCrucialConsiderationsinDependenceProfilingProgramcoverage=>needcompiler’ssupportoruseheuristicrulesInputsensitivityProfilingoverhead(spaceandtime)Usingaliasanddatadependenceprofilesisinherentlyspeculative=>needhardwaresupportforcorrectexecution2023/1/1130PCYew-TaiwanAliasProfilingvs.StaticAnalysis

Mostpossibledatadependencereportedbycompilerdonotoccuratruntime2023/1/1131PCYew-Taiwan

DataDependenceProfilingDatadependenceedgesamongmemoryreferencesandfunctioncallsDetailedinformationtype:flow,anti,output,orinputprobability:frequencyofoccurrenceWhenloopsaretargeteddependencedistance:limited2023/1/1132PCYew-TaiwanOverheadofProfiling96110102121120020406080bzip2craftygapgccgzipmcfparserperlbmktwolfvortexvpraverageXtimessloweraliasDDwithoutdistanceDDforinnermostloopsDD4-levelloopsCompiler:ORCversion2.0Machine:Itanium2,900MHzand2GmemoryBenchmarks:SPECCPU2000IntInstrumentationoptimizationhasbeendone2023/1/1133PCYew-TaiwanTechniquestoReduceProfilingOverheadReducethespacerequirementbyhashtableLargergranularityofaddressSmalleriterationcounterSamplingSamplethesnapshotsofproceduresorloopsinsteadofindividualreferencesUseinstrumentation-basedsamplingframeworkSwitchatproceduresorloops2023/1/1134PCYew-TaiwanConclusionsMicroprocessorshavecaughtupwithsupercomputersin’90andhavegonemulti-coreItisnon-trivialtoapplycurrentsupercomputingtechnologiestogeneral-purposeapplicationsNewarchitecturalsupportsuchasthread-levelspeculativeexecution,andnewcompilertechniquessuchasspeculativeoptimizationsusingaliasanddatadependenceprofiling,evendynamicoptimizationatruntime,arecrucial–asalwaysAveryexcitingandneweraforparallelprocessingmighthavearrived(especiallyinembeddedsystems)–finally!2023/1/1135PCYew-TaiwanReferencesJ.Linetal,ACompilerFrameworkforSpeculativeAnalysisandOptimizations,Proc.OfACM/SIGPLANConf.OnProgrammingLanguageDesignandImplementation(PLDI),June2003,alsoinACMTrans.OnArchitectureandCodeOptimization(TACO),Vol.1,No.3,Sept.2004,pp.247-271J.Linetal,RecoveryCodeGenerationforGeneralSpeculativeOptimizations,toappearinACMTrans.OnArchitectureandCodeOptimization(TACO)2005.(3)J.Linetal,SpeculativeRegisterPromotionUsingAdvancedLoadAddressTable(ALAT),Proc.OfIEEE/ACMInt’lSymp.OnCodeGenerationandOptimization(CGO),March2003(4)T.Chenetal,DataDependenceProfilingforSpeculativeOptimizations,Proc.OfInt’lConfonCompilerConstruction(CC),March2004(5)T.Chenetal,AnEmpiricalStudyontheGranularityofPointerAnalysisinCprograms,Proc.15thWorkshoponLanguagesandCompilersforParallelComputing(LCPC),August2002(6)J.Y.Tsaietal,TheSuperthreadedProcessorArchitecture,IEEETransonComputers,specialissueonMultithreadedArchitecture,Vol.48,No.9,Sept19992023/1/1136PCYew-TaiwanControlSpeculationld.s:movetheloadoperationacrossthebarri

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論