并行計(jì)算機(jī)體系結(jié)構(gòu)_第1頁
并行計(jì)算機(jī)體系結(jié)構(gòu)_第2頁
并行計(jì)算機(jī)體系結(jié)構(gòu)_第3頁
并行計(jì)算機(jī)體系結(jié)構(gòu)_第4頁
并行計(jì)算機(jī)體系結(jié)構(gòu)_第5頁
已閱讀5頁,還剩96頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

ParallelComputerArchitecture

并行計(jì)算機(jī)體系結(jié)構(gòu)

Lecture13May18,2009Wujunmin(jmwu@)OverviewReviewofLec11SMP中的同步MPP當(dāng)前高性能計(jì)算機(jī)介紹高性能計(jì)算機(jī)未來PreliminaryDesignIssuesDesignofcachecontrollerandtagsBothprocessorandbusneedtolookupHowandwhentopresentsnoopresultsonbusDealingwithwrite-backsOverallsetofactionsformemoryoperationnotatomicCanintroduceraceconditionsAtomicoperationsNewissuesdeadlock,livelock,starvation,serialization,etc.ContentionforCacheTagsCachecontrollermustmonitorbusandprocessorCanviewastwocontrollers:bus-side,andprocessor-sideWithsingle-levelcache:dualtags(notdata)ordual-portedtagRAMmustreconcilewhenupdated,butusuallyonlylookedupRespondtobustransactionsReportingSnoopResults:How?Collectiveresponsefrom$’smustappearonbusExample:inMESIprotocol,needtoknowIsblockdirty;i.e.shouldmemoryrespondornot?Isblockshared;i.e.transitiontoEorSstateonreadmiss?Threewired-ORsignalsShared:assertedifanycachehasacopyDirty:assertedifsomecachehasadirtycopyneedn’tknowwhich,sinceitwilldowhat’snecessarySnoop-valid:assertedwhenOKtocheckothertwosignalsactuallyinhibituntilOKtocheckIllinoisMESIrequirespriorityschemeforcache-to-cachetransfersWhichcacheshouldsupplydatawheninsharedstate?CommercialimplementationsallowmemorytoprovidedataReportingSnoopResults:When?Memoryneedstoknowwhat,ifanything,todoFixednumberofclocksfromaddressappearingonbusDualtagsrequiredtoreducecontentionwithprocessorStillmustbeconservative(updatebothonwrite:E->M)PentiumPro,HPservers,SunEnterpriseVariabledelayMemoryassumescachewillsupplydatatillallsay“sorry”Lessconservative,moreflexible,morecomplexMemorycanfetchdataandholdjustincase(SGIChallenge)Immediately:Bit-per-blockinmemoryExtrahardwarecomplexityincommoditymainmemorysystemBasicdesignNon-AtomicStateTransitionsMemoryoperationinvolvesmanyactionsbymanyentities,incl.busLookupcachetags,busarbitration,actionsbyothercontrollers,...Evenifbusisatomic,overallsetofactionsisnotCanhaveraceconditionsamongcomponentsofdifferentoperationsSupposeP1andP2attempttowritecachedblockAsimultaneouslyEachdecidestoissueBusUpgrtoallowS–>MIssuesMusthandlerequestsforotherblockswhilewaitingtoacquirebusMusthandlerequestsforthisblockAe.g.ifP2wins,P1mustinvalidatecopyandmodifyrequesttoBusRdXHandlingNon-atomicity:TransientStatesIncreasescomplexitye.g.don’tuseBusUpgr,ratherothermechanismstoavoiddatatransferTwotypesofstatesStable(e.g.MESI)TransientorIntermediateMultilevelCacheHierarchiesIndependentsnoophardwareforeachlevel?processorpinsforsharedbuscontentionforprocessorcacheaccess?SnooponlyatL2andpropagaterelevanttransactionsInclusionproperty(1)contentsL1isasubsetofL2(2)anyblockinmodifiedstateinL1isinmodifiedstateinL21=>alltransactionsrelevanttoL1arerelevanttoL22=>onBusRdL2canwaveoffmemoryaccessandinformL1PL1L2PL1L2°

°

°PL1L2snoopsnoop???ProcessorChipPL1L2associativity: a1blocksize: b1numberofsets:n1CapacityS1=a1*b1*n1associativity: a2blocksize: b2numberofsets:n2MaintainingInclusionThetwocaches(L1,L2)maychoosetoreplacedifferentblockDifferencesinreferencehistoryset-associativefirst-levelcachewithLRUreplacementexample:blocksm1,m2,m3fallinsamesetofL1cache...Splithigher-levelcachesinstruction,datablocksgoindifferentcachesatL1,butmaycollideinL2whatifL2isset-associative?DifferencesinblocksizeButacommoncase worksautomaticallyL1direct-mapped, fewersetsthaninL2, andblocksizesamePreservingInclusionExplicitlyPropagatelower-level(L2)replacementstohigher-level(L1)Invalidateorflush(ifdirty)messagesPropagatebustransactionsfromL2toL1PropagateallL2transactions?useinclusionbits?PropagatemodifiedstatefromL1toL2onwrites?ifL1iswrite-through,justinvalidateifL1iswrite-backaddextrastatetoL2(dirty-but-stale)requestflushfromL1onBusRdOverviewReviewofLec11SMP中的同步MPP當(dāng)前高性能計(jì)算機(jī)介紹高性能計(jì)算機(jī)未來RoleofSynchronization“Aparallelcomputerisacollectionofprocessingelementsthatcooperateandcommunicatetosolvelargeproblemsfast.”TypesofSynchronizationMutualExclusionEventsynchronizationpoint-to-pointgroupglobal(barriers)Howmuchhardwaresupport?high-leveloperations?atomicinstructions?specializedinterconnect?Mini-InstructionSetdebateatomicread-modify-writeinstructionsIBM370:includedatomiccompare&swapformultiprogrammingx86:anyinstructioncanbeprefixedwithalockmodifierHigh-levellanguageadvocateswanthardwarelocks/barriersbutit’sgoesagainstthe“RISC”flow,andhasotherproblemsSPARC:atomicregister-memoryops(swap,compare&swap)MIPS,IBMPower:noatomicoperationsbutpairofinstructionsload-locked,store-conditionallaterusedbyPowerPCandDECAlphatooRichsetoftradeoffsOtherformsofhardwaresupportSeparatelocklinesonthebusLocklocationsinmemoryLockregisters(CrayXmp)Hardwarefull/emptybits(Tera)BussupportforinterruptdispatchComponentsofaSynchronizationEventAcquiremethodAcquirerighttothesynchentercriticalsection,gopasteventWaitingalgorithmWaitforsynchtobecomeavailablewhenitisn’tbusy-waiting,blocking,orhybridReleasemethodEnableotherprocessorstoacquirerighttothesynchWaitingalgorithmisindependentoftypeofsynchronizationmakesnosensetoputinhardwareStrawmanLocklock: ld register,location

/*copylocationtoregister*/

cmp location,#0

/*comparewith0*/

bnz lock

/*ifnot0,tryagain*/

st location,#1

/*store1tomarkitlocked*/

ret

/*returncontroltocaller*/unlock: st location,#0

/*write0tolocation*/

ret

/*returncontroltocaller*/Busy-WaitWhydoesn’ttheacquiremethodwork?Releasemethod?AtomicInstructionsSpecifiesalocation,register,&atomicoperationValueinlocationreadintoaregisterAnothervalue(functionofvaluereadornot)storedintolocationManyvariantsVaryingdegreesofflexibilityinsecondpartSimpleexample:test&setValueinlocationreadintoaspecifiedregisterConstant1storedintolocationSuccessfulifvalueloadedintoregisteris0Otherconstantscouldbeusedinsteadof1and0SimpleTest&SetLocklock: t&s register,location

bnz lock /*ifnot0,tryagain*/ ret /*returncontroltocaller*/unlock: st location,#0 /*write0tolocation*/ ret /*returncontroltocaller*/Otherread-modify-writeprimitivesSwapFetch&opCompare&swapThreeoperands:location,registertocomparewith,registertoswapwithNotcommonlysupportedbyRISCinstructionsetscacheableoruncacheablePerformanceCriteriaforSynch.OpsLatency(timeperop)especiallywhenlightcontentionBandwidth(opspersec)especiallyunderhighcontentionTrafficloadoncriticalresourcesespeciallyonfailuresundercontentionStorageFairnessEnhancementstoSimpleLockReducefrequencyofissuingtest&setswhilewaitingTest&setlockwithbackoffDon’tbackofftoomuchorwillbebackedoffwhenlockbecomesfreeExponentialbackoffworksquitewellempirically:ithtime=k*ciBusy-waitwithreadoperationsratherthantest&setTest-and-test&setlockKeeptestingwithordinaryloadcachedlockvariablewillbeinvalidatedwhenreleaseoccursWhenvaluechanges(to0),trytoobtainlockwithtest&setonlyoneattemptorwillsucceed;otherswillfailandstarttestingagainImprovedHardwarePrimitives:LL-SCGoals:TestwithreadsFailedread-modify-writeattemptsdon’tgenerateinvalidationsNiceifsingleprimitivecanimplementrangeofr-m-woperationsLoad-Locked(or-linked),Store-ConditionalLLreadsvariableintoregisterFollowwitharbitraryinstructionstomanipulateitsvalueSCtriestostorebacktolocationsucceedifandonlyifnootherwritetothevariablesincethisprocessor’sLLindicatedbyconditioncodes;IfSCsucceeds,allthreestepshappenedatomicallyIffails,doesn’twriteorgenerateinvalidationsmustretryacquireSimpleLockwithLL-SClock: ll reg1,location

/*LLlocationtoreg1*/

bnzreg1,lock//其他操作 sc location,reg2

/*SCreg2intolocation*/

beqz lock

/*iffailed,startagain*/ ret unlock: st location,#0 /*write0tolocation*/ ret Candomorefancyatomicopsbychangingwhat’sbetweenLL&SCButkeepitsmallsoSClikelytosucceedDon’tincludeinstructionsthatwouldneedtobeundone(e.g.stores)SCcanfail(withoutputtingtransactiononbus)if:DetectsinterveningwriteevenbeforetryingtogetbusTriestogetbusbutanotherprocessor’sSCgetsbusfirstLL,SCarenotlock,unlockrespectivelyOnlyguaranteenoconflictingwritetolockvariablebetweenthemButcanusedirectlytoimplementsimpleoperationsonsharedvariablesImplementingLL-SCLockflagandlockaddressregisterateachprocessorLLreadsblock,setslockflag,putsblockaddressinregisterIncominginvalidationscheckedagainstaddress:ifmatch,resetflagAlsoifblockisreplacedandatcontextswitchesSCcheckslockflagasindicatorofinterveningconflictingwriteIfreset,fail;ifnot,succeedLivelockconsiderationsDon’tallowreplacementoflockvariablebetweenLLandSCsplitorset-assoc.cache,anddon’tallowmemoryaccessesbetweenLL,SC(alsodon’tallowreorderingofaccessesacrossLLorSC)Don’tallowfailingSCtogenerateinvalidations(notanordinarywrite)Performance:bothLLandSCcanmissincachePrefetchblockinexclusivestateatLLButexclusiverequestreintroduceslivelockpossibility:usebackoffTrade-offsSoFarLatency?Bandwidth?Traffic?Storage?Fairness?Whathappenswhenseveralprocessorsspinningonlockanditisreleased?trafficperPlockoperations?TicketLockOnlyoner-m-wperacquireTwocountersperlock(next_ticket,now_serving)Acquire:fetch&incnext_ticket; waitfornow_serving==next_ticketatomicopwhenarriveatlock,notwhenit’sfree(solesscontention)Release:incrementnow-servingPerformancelowlatencyforlow-contention-iffetch&inccacheableO(p)readmissesatrelease,sinceallspinonsamevariableFIFOorderlikesimpleLL-SClock,butnoinvalwhenSCsucceeds,andfairBackoff?Wouldn’titbenicetopolldifferentlocations...Array-basedQueuingLocksWaitingprocessespollondifferentlocationsinanarrayofsizepAcquirefetch&inctoobtainaddressonwhichtospin(nextarrayelement)ensurethattheseaddressesareindifferentcachelinesormemoriesReleasesetnextlocationinarray,thuswakingupprocessspinningonitO(1)trafficperacquirewithcoherentcachesFIFOordering,asinticketlock,but,O(p)spaceperlockNotsogreatfornon-cache-coherentmachineswithdistributedmemoryarraylocationIspinonnotnecessarilyinmylocalmemory(solutionlater)LockPerformanceonSGIChallengeLoop: lock; delay(c); unlock; delay(d);lArray-based6LL-SCnLL-SC,

exponentialuTicketsTicket,

proportionallllllllllllllll666666666666666nnnnnnnnnnnnnnnuuuuuuuuuuuuuuusssssssssssssss011

3579

11131513579111315135791113152345670123456701234567lllllllllllllll666666666666666nnnnnnnnnnnnnnnuuuuuuuuuuuuuuussssssssssssssslllllllllllllll666666666666666nnnnnnnnnnnnnnnuuuuuuuuuuuuuuusssssssssssssss

(a)Null(c=0,d=0)(b)Critical-section(c=3.64s,d=0)(c)Delay(c=3.64s,d=1.29s)Time(s)Time(s)Time(s)NumberofprocessorsNumberofprocessorsNumberofprocessorsPointtoPointEventSynchronizationSoftwaremethods:InterruptsBusy-waiting:useordinaryvariablesasflagsBlocking:usesemaphoresFullhardwaresupport:full-emptybitwitheachwordinmemorySetwhenwordis“full”withnewlyproduceddata(i.e.whenwritten)Unsetwhenwordis“empty”duetobeingconsumed(i.e.whenread)Naturalforword-levelproducer-consumersynchronizationproducer:writeifempty,settofull;consumer:readiffull;settoemptyHardwarepreservesatomicityofbitmanipulationwithreadorwriteProblem:flexibilitymultipleconsumers,ormultiplewritesbeforeconsumerreads?needslanguagesupporttospecifywhentousecompositedatastructures?BarriersSoftwarealgorithmsimplementedusinglocks,flags,countersHardwarebarriersWired-ANDlineseparatefromaddress/databusSetinputhighwhenarrive,waitforoutputtobehightoleaveInpractice,multiplewirestoallowreuseUsefulwhenbarriersareglobalandveryfrequentDifficulttosupportarbitrarysubsetofprocessorsevenharderwithmultipleprocessesperprocessorDifficulttodynamicallychangenumberandidentityofparticipantse.g.latterduetoprocessmigrationNotcommontodayonbus-basedmachinesstructbar_type{intcounter;structlock_typelock; intflag=0;}bar_name;BARRIER(bar_name,p){ LOCK(bar_name.lock); if(bar_name.counter==0) bar_name.flag=0; /*resetflagiffirsttoreach*/

mycount=bar_name.counter++; /*mycountisprivate*/ UNLOCK(bar_name.lock); if(mycount==p){ /*lasttoarrive*/

bar_name.counter=0; /*resetfornextbarrier*/ bar_name.flag=1; /*releasewaiters*/ } elsewhile(bar_name.flag==0){};/*busywaitforrelease*/}ASimpleCentralizedBarrierSharedcountermaintainsnumberofprocessesthathavearrivedincrementwhenarrive(lock),checkuntilreachesnumprocsProblem?AWorkingCentralizedBarrierConsecutivelyenteringthesamebarrierdoesn’tworkMustpreventprocessfromenteringuntilallhaveleftpreviousinstanceCoulduseanothercounter,butincreaseslatencyandcontentionSensereversal:waitforflagtotakedifferentvalueconsecutivetimesTogglethisvalueonlywhenallprocessesreachBARRIER(bar_name,p){ local_sense=!(local_sense);/*toggleprivatesensevariable*/ LOCK(bar_name.lock);

mycount=bar_name.counter++; /*mycountisprivate*/ if(bar_name.counter==p) UNLOCK(bar_name.lock); bar_name.flag=local_sense; /*releasewaiters*/ else { UNLOCK(bar_name.lock); while(bar_name.flag!=local_sense){};}}CentralizedBarrierPerformanceLatencyCentralizedhascriticalpathlengthatleastproportionaltopTrafficAbout3pbustransactionsStorageCostVerylow:centralizedcounterandflagFairnessSameprocessorshouldnotalwaysbelasttoexitbarrierNosuchbiasincentralizedKeyproblemsforcentralizedbarrierarelatencyandtrafficEspeciallywithdistributedmemory,trafficgoestosamenodeBarrierPerformanceonSGIChallengeCentralizeddoesquitewellWilldiscussfancierbarrieralgorithmsfordistributedmachinesHelpfulhardwaresupport:piggybackingofreadsmissesonbusAlsoforspinningonhighlycontendedlocks12345678SynchronizationSummaryRichinteractionofhardware-softwaretradeoffsMustevaluatehardwareprimitivesandsoftwarealgorithmstogetherprimitivesdeterminewhichalgorithmsperformwellEvaluationmethodologyischallengingUseofdelays,microbenchmarksShouldusebothmicrobenchmarksandrealworkloadsSimplesoftwarealgorithmswithcommonhardwareprimitivesdowellonbusWillseemoresophisticatedtechniquesfordistributedmachinesHardwaresupportstillsubjectofdebateTheoreticalresearcharguesforswaporcompare&swap,notfetch&opAlgorithmsthatensureconstant-timeaccess,butcomplexOverviewReviewofLec11SMP中的同步MPP當(dāng)前高性能計(jì)算機(jī)介紹高性能計(jì)算機(jī)未來第五章大規(guī)模并行處理機(jī)系統(tǒng)(MPP)MassivelyParallelProcessorMPP概述大規(guī)模并行處理機(jī)MPP(MassivelyParallelProcessor)通常是指具有下列特點(diǎn)的大規(guī)模的并行計(jì)算機(jī)系統(tǒng):節(jié)點(diǎn)中使用商品化微處理器,且每個(gè)節(jié)點(diǎn)有一個(gè)或多個(gè)微處理器;節(jié)點(diǎn)內(nèi)使用物理上分布的存儲(chǔ)器;具有高通信帶寬和低延遲的互連網(wǎng)絡(luò),節(jié)點(diǎn)間緊耦合;能擴(kuò)展成具有成百上千個(gè)處理器;一個(gè)異步多指令流多數(shù)據(jù)流MIMD機(jī)IntelParagon、IBMSP2、IntelTFLOPS和我國的曙光-1000等都是MPP兩種實(shí)現(xiàn)途徑NCC-NUMA體系結(jié)構(gòu),CrayT3ENORMA體系結(jié)構(gòu),Intel/SandiaASCIOptionRed與機(jī)群的概念很模糊差別縮小關(guān)鍵差別在于節(jié)點(diǎn)間的通信MPP的結(jié)構(gòu)圖MPP特性可擴(kuò)放性:使用物理上分布主存的體系結(jié)構(gòu)平衡的處理和存儲(chǔ)能力平衡的計(jì)算和并行交互能力系統(tǒng)成本:使用現(xiàn)有的商品化CMOS微處理器沒有足夠大的物理地址空間沒有足夠大的TLB無阻塞高速緩存異常處理與邊界保護(hù)使用穩(wěn)定的體系結(jié)構(gòu)以支持換代的可擴(kuò)放性——shell結(jié)構(gòu)使用物理分布主存的體系結(jié)構(gòu)使用SMP節(jié)點(diǎn)MPP特性(cont‘d)通用性和可用性:支持通用的異步MIMD模式;支持流行的標(biāo)準(zhǔn)編程模式,如消息傳遞(PVMMPI)和數(shù)據(jù)并行(HPF)等;節(jié)點(diǎn)被分配到若干個(gè)“池”中,支持不同作業(yè);內(nèi)部互連拓?fù)浣Y(jié)構(gòu)對(duì)用戶透明,用戶只看到全互連的節(jié)點(diǎn)集合;支持單一系統(tǒng)映象SSI(SingleSystemImage),緊耦合MPP通常使用分布式操作系統(tǒng),在硬件和OS層提供單一系統(tǒng)映像;必須使用高可用性技術(shù)主存和I/O性能非常大的總主存和磁盤容量。商用MPP尤其注重高速I/O系統(tǒng)提供可擴(kuò)放的I/O子系統(tǒng)比較MPP模型Intel/SandiaASCIOptionRedIBMSP2SGI/CrayOrigin2000一個(gè)大型樣機(jī)的配置9072個(gè)處理器,1.8Tflop/s(NSL)400個(gè)處理器,100Gflop/s(MHPCC)128個(gè)處理器,51Gflop/s(NCSA)問世日期1996年12月1994年9月1996年10月處理器類型200MHz,200Mflop/sPentiumPro67MHz,267Mflop/sPOWER2200MHz,400Mflop/sMIPSR10000節(jié)點(diǎn)體系結(jié)構(gòu)和數(shù)據(jù)存儲(chǔ)器2個(gè)處理器,32到256MB主存,共享磁盤1個(gè)處理器,64MB到2GB本地主存,1GB到14.5GB本地磁盤2個(gè)處理器,64MB到256MB分布共享主存和共享磁盤互連網(wǎng)絡(luò)和主存模型分離兩維網(wǎng)孔,NORMA多級(jí)網(wǎng)絡(luò),NORMA胖超立方體網(wǎng)絡(luò),CC-NUMA節(jié)點(diǎn)操作系統(tǒng)輕量級(jí)內(nèi)核(LWK)完全AIX(IBMUNIX)微內(nèi)核CellularIRIX自然編程機(jī)制基于PUMAPortals的MPIMPI和PVMPowerC,PowerFortran其他編程模型Nx,PVM,HPFHPF,LindaMPI,PVMMPP系統(tǒng)面臨的主要問題實(shí)際的性能差:MPP的實(shí)際可用性能通常遠(yuǎn)低于其峰值性能;可編程性:并行程序的開發(fā)比較困難,串行程序向并行程序的自動(dòng)轉(zhuǎn)換效果不好,且不同平臺(tái)間并行程序的有效移植也有一定的難度。功耗大,需要苛刻的散熱和通風(fēng)條件占地面積大實(shí)例分析1:CrayT3E的體系結(jié)構(gòu)性能特點(diǎn)分布式共享主存(NCC-NUMA)的多處理機(jī)。多個(gè)處理單元PE(ProcessingElement)通過一個(gè)三維雙向環(huán)網(wǎng)互連由一些千兆環(huán)通道提供與I/O設(shè)備的連接T3E的體系結(jié)構(gòu)特性。T3E-900是1996年底發(fā)布的T3E增強(qiáng)型。屬性T3ET3E-900處理器時(shí)鐘頻率(MHz)300450峰值處理器速度(Mflops)600900處理器數(shù)量6~20486~2048系統(tǒng)峰值速度(Gflops)3.6~12285.4~1843物理主存容量(GB)1~40961~4096總峰值主存帶寬(GB/s)7.2~24507.2~2450I/O通道最大數(shù)目1~1281~128總峰值I/O帶寬(GB/s)1~1281~128峰值三維環(huán)網(wǎng)鏈接帶寬(MB/s)600600ASCI/MPP系統(tǒng)ASCI(AcceleratedStrategicComputingInitiative):1994年DOE該計(jì)劃為期十年,耗資十億美元制造Tflop/s的超級(jí)計(jì)算機(jī)系統(tǒng),AdvancedSimulationandComputingProgramLawrenceLivermore,LosAlamos,andSandianationallaboratoriesShiftfromtest-basedconfidencetosimulation-basedconfidence.Computermanufacturers:Intel,IBM,SGI/Cray,HPFiveuniversities:CalTech/Stanford/UniversityofChicago/UniversityofIllinoisatUrbana-Champaign/UniversityofUtahTheLosAlamosASCIQdedicatedinMay2002Hewlett-PackardASCIQ-AlphaServerSCES45/1.25GHz/40967727.00/10240.00LosAlamosNationalLaboratoryUSA2002/3Hewlett-PackardASCIQ-AlphaServerSCES45/1.25GHz/40967727.00/10240.00LosAlamosNationalLaboratoryUSA2002/4IBMASCIWhite,SPPower3375MHz/81927226.00/12288.00LawrenceLivermoreNationalLaboratoryUSA/20005ASCIRedASCIBluePacificASCIBlueMountainASCI可擴(kuò)放設(shè)計(jì)策略加速發(fā)展1996年/1Tflop/s系統(tǒng),2000年/10至30Tflop/s系統(tǒng),2004年/100Tflop/s系統(tǒng),且這些系統(tǒng)應(yīng)該成本相近。不僅瞄準(zhǔn)峰值速度,而且總的系統(tǒng)持續(xù)的應(yīng)用性能要105倍于1994年平衡的可擴(kuò)放設(shè)計(jì)著重用于科學(xué)計(jì)算應(yīng)用的高端平臺(tái),而非大批量市場(chǎng)平臺(tái)和市場(chǎng)熱點(diǎn)應(yīng)用;使用盡可能多的商品化市售(COTS)硬件和軟件部件,著重開發(fā)主流計(jì)算機(jī)公司未有效提供的關(guān)鍵技術(shù);使用大規(guī)模并行體系結(jié)構(gòu),著重于縮放和集成技術(shù),將數(shù)千個(gè)COTS節(jié)點(diǎn)納入一個(gè)有單一系統(tǒng)映象的高效平臺(tái)ASCI平臺(tái)性能發(fā)展圖平衡設(shè)計(jì)策略端對(duì)端性能平衡的可擴(kuò)放硬件一條平衡設(shè)計(jì)準(zhǔn)則:1Gflop/s峰值速度應(yīng)與1GB主存、50GB磁盤、10TB檔案存儲(chǔ)器、16GB/s高速緩存帶寬、3GB/s主存帶寬、0.1GB/sI/O磁盤帶寬以及1MB/s檔案存儲(chǔ)器帶寬相匹配;平衡的可擴(kuò)放軟件ASCI認(rèn)為新的軟件開發(fā)將使性能改進(jìn)10到100倍屬性1996199719982003應(yīng)用性能(倍數(shù))11000100,000峰值計(jì)算速度(Gflops)100100010,000100,000主存容量(TB)0.050.5550磁盤容量(TB)0.1~11~1010~100100~1000檔案存儲(chǔ)容量(PB)0.131.313130I/O速度(GB/s)5505005000網(wǎng)絡(luò)速度(GB/s)0.131.313130硬件要求ASCI超級(jí)計(jì)算機(jī)的處理器、存儲(chǔ)器體系結(jié)構(gòu)和I/O子系統(tǒng)的要求均有詳細(xì)說明。例如,ASCI對(duì)存儲(chǔ)器要求如下表所示。存儲(chǔ)器層次有效時(shí)延(CPU周期)讀/寫帶寬*存儲(chǔ)容量**片內(nèi)高速緩存,L12~316~32B/cycle10-4B/flop/s片外高速緩存,L25~616B/cycle10-2B/flop/s本地主存30~80(15~30)2~8B/flop峰值(2~8B/flop持續(xù))1B/flop/s鄰近節(jié)點(diǎn)300~500(30~50)1~8B/flop(8B/flop)1B/flop/s1B/flop/s遠(yuǎn)處節(jié)點(diǎn)1000(100~200)1B/flop1B/flop/sI/O速度(主存—磁盤)10ms0.01~0.1B/flop10~100B/flop/s檔案(磁盤—磁帶)秒級(jí)0.001B/flop(0.01~0.1B/flop)100B/flop/s(104B/flop/s)用戶存取時(shí)間0.1s(1/60s)OC3/desktop(OC12~48/desktop)100個(gè)用戶多地點(diǎn)0.1s未知未知注:粗體指標(biāo)表示工業(yè)界無法滿足1997年要求。細(xì)體指標(biāo)與之相反。大部分指標(biāo)的需要在1998年滿足,括號(hào)內(nèi)的指標(biāo)定于2000年滿足。*每單位工作負(fù)載或每CPU時(shí)鐘的帶寬。**每單位速度(flop/s)的容量。軟件要求軟件工業(yè)遠(yuǎn)遠(yuǎn)落后于要求。

ASCI對(duì)軟件要求作了詳細(xì)說明:人/機(jī)界面:可視化和因特網(wǎng)技術(shù);應(yīng)用環(huán)境:數(shù)學(xué)算法、網(wǎng)格生成、域分解和科學(xué)數(shù)據(jù)管理;編程環(huán)境:編程模型、庫、編譯器、調(diào)試器、性能工具和對(duì)象技術(shù);分布式操作軟件:I/O、文件和存儲(chǔ)系統(tǒng)、可靠性、通信、系統(tǒng)管理、分布式資源管理;診斷性能監(jiān)控器:系統(tǒng)狀態(tài)正常和監(jiān)控軟件要求安全性可擴(kuò)放性功能性可移植性人機(jī)界面↑Δ↓Δ可視化↓Δ因特網(wǎng)↑Δ↑●應(yīng)用環(huán)境↑●↓Δ↓Δ↑Δ編程環(huán)境↓Δ↓Δ↓Δ↓Δ分布式操作軟件↓Δ↓Δ↓Δ↓Δ診斷性能監(jiān)控器↑●↓Δ↑●↓●注:↑表示工業(yè)能滿足需求。↓表示工業(yè)無法滿足需求。Δ表示需求隨時(shí)間上升?!癖硎拘枨蟊3植蛔?。定約的ASCI/MPP平臺(tái)OptionRed、BluePacific、BlueMountian和OptionWhite、ASCIQ等MPP系統(tǒng)已被安裝在3個(gè)國家實(shí)驗(yàn)室IntelOptionRed典型的MPP系統(tǒng)SGIBlueMountain系統(tǒng)由48個(gè)節(jié)點(diǎn)的機(jī)群所組成,其中每一個(gè)節(jié)點(diǎn)是一個(gè)有128個(gè)處理器的Origin2000CC-NUMA系統(tǒng)。節(jié)點(diǎn)內(nèi)的互連為胖超立方體。48個(gè)Origin2000系統(tǒng)用4兆位HiPPI一800交換開關(guān)連成一個(gè)機(jī)群,其中每個(gè)鏈路的雙向峰值帶寬為1.6Gb/s2個(gè)IBM系統(tǒng)均為高端SP系統(tǒng)HPASCIQASCIRedStorm,ASCIPurple,IBMBlueGene/L/P四個(gè)ASCI比較特性O(shè)ptionRedOptionBlueOptionWhiteBluePacificBlueMountain制造商IntelIBMSGIIBM安裝場(chǎng)所SandiaLivermoreLosAlamosLivermore完成日期1997年6月1998年12月1998年12月2000年12月成本(百萬美元)5594<11085所選用處理器PentiumPro200MHz200Mflop/sPowerPC604332MHz664Mflop/sMIPS10000250MHz500Mflop/sPOWER3311MHz1244Mflop/s系統(tǒng)體系結(jié)構(gòu)NORMA-MPPSMP機(jī)群

4CPU/節(jié)點(diǎn)1464節(jié)點(diǎn)CC-NUMA機(jī)群128CPU/節(jié)點(diǎn)48節(jié)點(diǎn)SMP機(jī)群16CPU/節(jié)點(diǎn)512節(jié)點(diǎn)節(jié)點(diǎn)內(nèi)連接總線交叉開關(guān)胖超立方體交叉開關(guān)節(jié)點(diǎn)間連接分離2D網(wǎng)孔Omega開關(guān)千兆位開關(guān)Omega開關(guān)處理器數(shù)量9216585661448192峰值速度1.8Tflop/s3.888Tflop/s3.072Tflop/s10.2Tflop/s主存容量594GB2.5TB1.5TB4TB磁盤容量1TB75TB75TB150TBASCIOptionRedASCIBlue-PacificASCIBlue-MountainASCIWhite實(shí)例分析2:Intel/SandiaASCIOptionRed磁盤OptionRed的體系結(jié)構(gòu)共有4608個(gè)節(jié)點(diǎn)(每個(gè)節(jié)點(diǎn)有兩個(gè)200MHzPentiumPro處理器)和594GB的主存,其峰值速度為1.8Tflop/s、峰值截面(Cross-Section)帶寬為51GB/s。計(jì)算節(jié)點(diǎn)(ComputeNode)4536個(gè),執(zhí)行并行計(jì)算服務(wù)節(jié)點(diǎn)(ServiceNode)32個(gè),用于支持登錄、軟件開發(fā)及其它交互操作I/O節(jié)點(diǎn)(I/ONode)24個(gè),用于存取磁盤、磁帶、網(wǎng)絡(luò)(以太網(wǎng)、FDDI、ATM等)和其它I/O設(shè)備系統(tǒng)節(jié)點(diǎn)(SystemNode)2個(gè),用于支持系統(tǒng)RAS能力:其中引導(dǎo)節(jié)點(diǎn)(BootNode)負(fù)責(zé)初始系統(tǒng)引導(dǎo)及提供服務(wù);節(jié)點(diǎn)站(NodeStation)用于單一系統(tǒng)映象支持備份節(jié)點(diǎn)。1540個(gè)供給電源,616個(gè)互連底板和640個(gè)磁盤(大于1TB的容量)節(jié)點(diǎn)體系結(jié)構(gòu)計(jì)算節(jié)點(diǎn)和服務(wù)節(jié)點(diǎn)的實(shí)現(xiàn)相同兩個(gè)節(jié)點(diǎn)在一塊主板上。兩個(gè)SMP節(jié)點(diǎn)通過網(wǎng)絡(luò)接口部件NIC相連在一起,只有一個(gè)NIC連向互連底板。每個(gè)節(jié)點(diǎn)的本地I/O包括以下部分:一個(gè)稱為節(jié)點(diǎn)維護(hù)端口(NodeMaintenancePort)的串行口,它連至系統(tǒng)內(nèi)部以太網(wǎng),并用于系統(tǒng)引導(dǎo)程序、診斷和RAS;擴(kuò)展連接器用于節(jié)點(diǎn)測(cè)試;引導(dǎo)支持硬件包括一個(gè)快閃(Flash)ROM,內(nèi)含節(jié)點(diǎn)可信測(cè)試(NodeConfidenceTest)、BIOS以及診斷節(jié)點(diǎn)失效和裝載操作系統(tǒng)所需的其它代碼。I/O和系統(tǒng)節(jié)點(diǎn)的主板只有2個(gè)處理器(1個(gè)節(jié)點(diǎn))、l個(gè)本地單總線和1個(gè)單NIC。每個(gè)節(jié)點(diǎn)的主存容量可上升至64MB到1GB。133MB/s的PCI卡數(shù)量可上升到3。每個(gè)I/O節(jié)點(diǎn)主板上基本I/O設(shè)備,如RS232、以太網(wǎng)(10Mbps)和Fast-WideSCS節(jié)點(diǎn)結(jié)構(gòu)圖系統(tǒng)互連

節(jié)點(diǎn)由一個(gè)內(nèi)部互連設(shè)備ICF相連ICF使用了雙平面(Two-Plane)網(wǎng)孔拓?fù)?。每個(gè)節(jié)點(diǎn)主板通過主板上的NIC網(wǎng)孔選路部件MRC(MeshRoutingComponent)。MRC有六個(gè)雙向端口,每個(gè)能以400MB/s的單向峰值速度傳送數(shù)據(jù),全雙工時(shí)為800MB/s,4個(gè)端口用于平面內(nèi)左、右、上、下的網(wǎng)孔互連,還有一個(gè)端口用于平面間互連。從任意節(jié)點(diǎn)發(fā)出的消息借助蟲蝕選路通過任一平面送至另一節(jié)點(diǎn),這將降低時(shí)延,從而提高了系統(tǒng)可用性

OptionRed的系統(tǒng)軟件ASCIOptionRed系統(tǒng)軟件:系統(tǒng)、服務(wù)和I/O節(jié)點(diǎn)都運(yùn)行Paragon操作系統(tǒng),它是一個(gè)基于OSF的分布式Unix系統(tǒng)。計(jì)算節(jié)點(diǎn)運(yùn)行一個(gè)稱為Cougar的輕量級(jí)內(nèi)核LWK(Light-WeightKernel)。同時(shí)提供了對(duì)這兩個(gè)系統(tǒng)間接口的支持,包括高速通信、unix編程接口和一個(gè)并行文件系統(tǒng)輕量級(jí)內(nèi)核源于PUMA系統(tǒng)LWK設(shè)計(jì)更強(qiáng)調(diào)性能,它能有效支持多達(dá)幾千個(gè)節(jié)點(diǎn)的MPP,只提供并行計(jì)算所需的功能,而不是一般的操作系統(tǒng)服務(wù);由于TFLOPS系統(tǒng)中有幾千個(gè)計(jì)算節(jié)點(diǎn),Cougar被設(shè)計(jì)成主存占用量在0.5MB以下,以阻止LWK使用的聚集主存上升過快;設(shè)計(jì)中假設(shè)通信網(wǎng)絡(luò)是可信的并由內(nèi)核控制,不需要保護(hù)檢查和消息鑒別;LWK提供一個(gè)開放的體系結(jié)構(gòu),允許用戶層庫例程的高效開發(fā)LWK進(jìn)程控制線程PCT(ProcessControlThread),PCT提供進(jìn)程管理、命名服務(wù)和組保護(hù)功能。精華內(nèi)核Q-Kernel(QuintessentialKernel),Q-Kernel是唯一可以直接訪問地址映射和通信硬件的軟件。它提供了基本的計(jì)算、通信和地址空間保護(hù)功能。每個(gè)節(jié)點(diǎn)有一些用戶進(jìn)程,一個(gè)PCT和一個(gè)Q-kernel。消息傳遞

ASCIOptionRed系統(tǒng)支持MPI、NX和消息傳遞入口,其中MPI是系統(tǒng)中的標(biāo)準(zhǔn)庫,而NX是為了提供對(duì)Paragon的向后兼容。消息傳遞入口(Portal)提供了最為有效的低層消息傳遞庫,入口的概念是在PUMA操作系統(tǒng)中首先提出的,它的使用可以降低消息傳遞中的存儲(chǔ)器拷貝開銷。使用入口的消息傳遞不屬于用戶層通信機(jī)制,仍必須跨越內(nèi)核。入口是目的進(jìn)程地址空間的一部分,該部分向其它進(jìn)程開放以發(fā)送消息。為發(fā)送一條消息,發(fā)送進(jìn)程需執(zhí)行如下的核心例程:

send_user_msg{void*buf /*發(fā)送消息緩沖區(qū)起始點(diǎn)*/

size_t

len /*發(fā)送消息的大小*/

inttag /*消息標(biāo)記*/

proc_id

dest /*目的進(jìn)程號(hào)*/

portal_idportal /*目的入口的索引*/

int*flag /*消息發(fā)送的增量標(biāo)記*/}三個(gè)典型的MPP系統(tǒng)的運(yùn)行性能評(píng)估IBMSP2,IntelParagon,CrayT3D節(jié)點(diǎn)體系結(jié)構(gòu):三個(gè)MPP中,得益于267Mflop/s的峰值速度以及為POWER2微處理器設(shè)計(jì)的良好的優(yōu)化編譯器,使得SP2有最佳的速度和利用率。

Alpha21064雖有更高的時(shí)鐘速率,但I(xiàn)LP較低。SP2的另一個(gè)優(yōu)點(diǎn)是,它允許有很大的節(jié)點(diǎn)主存。而Paragon只有16M。內(nèi)核和服務(wù)器將使用超過6.5MB主存,NX消息緩沖區(qū)占用另外1MB主存,只剩不到8MB用于數(shù)據(jù)存儲(chǔ)。交換網(wǎng)絡(luò)的性能與可擴(kuò)放性MPP中的通信相當(dāng)昂貴,T3D上的點(diǎn)對(duì)點(diǎn)消息傳遞提供了2μs的最低時(shí)延,SP2和Paragon有低于40μs的類似時(shí)延。Paragon兩維網(wǎng)孔顯示了最高的規(guī)??蓴U(kuò)放性。接下來是三維環(huán)網(wǎng)可擴(kuò)放至1024個(gè)節(jié)點(diǎn)。三個(gè)平臺(tái)中的并行I/OParagon中文件I/O由I/O節(jié)點(diǎn)提供。這些節(jié)點(diǎn)通常位于兩維網(wǎng)孔的外列。每個(gè)I/O節(jié)點(diǎn)連接至一個(gè)4.8GBRAID3磁盤陣列。Intel的并行文件系統(tǒng)PFS(ParallelFileSystem)提供了對(duì)文件的并行存取,計(jì)算節(jié)點(diǎn)的每個(gè)磁盤存取需要和I/O節(jié)點(diǎn)進(jìn)行一次消息交換。I/O性能更多地受網(wǎng)絡(luò)通信量影響。

SP2的每個(gè)節(jié)點(diǎn)連接至一個(gè)本地磁盤。無需區(qū)分I/O節(jié)點(diǎn)和計(jì)算節(jié)點(diǎn)。在SP2中,每個(gè)節(jié)點(diǎn)運(yùn)行一個(gè)完整的IBM/AIX操作系統(tǒng)。磁盤直接連接到每個(gè)節(jié)點(diǎn)。I/O節(jié)點(diǎn)由軟件動(dòng)態(tài)定義,PFS允許用戶創(chuàng)建跨越許多SP2節(jié)點(diǎn)的文件。在T3D中,磁盤僅連至主機(jī)CrayC90或CrayYMP。I/O節(jié)點(diǎn)通過I/O網(wǎng)關(guān)連接到主機(jī)。每個(gè)I/O網(wǎng)關(guān)包括兩個(gè)節(jié)點(diǎn),每一節(jié)點(diǎn)包含單個(gè)Alpha處理器以及4M字主存(計(jì)算處理器主存的一半)和特殊的通信硬件。一個(gè)節(jié)點(diǎn)處理一個(gè)方向上的I/O,用于系統(tǒng)調(diào)用和文件存取MPP小結(jié)八十年代后期及九十年代中前期迅速發(fā)展ThinkingMachine公司的CM5,Intel公司的Paragon,IBM公司的SP2,以及Cray公司的T3D主要被用于科學(xué)計(jì)算

九十年代后期,隨著一些專門生產(chǎn)并行機(jī)的公司的倒閉或被兼并,MPP系統(tǒng)慢慢從主流的并行處理市場(chǎng)退出由于消息傳遞系統(tǒng)相對(duì)共享存儲(chǔ)系統(tǒng)比較容易實(shí)現(xiàn),它仍成為實(shí)現(xiàn)超大規(guī)模并行處理的重要手段,不過由于價(jià)格和應(yīng)用領(lǐng)域的原因,基于消息傳遞的MPP系統(tǒng)的研制逐漸成為了政府行為新涌現(xiàn)的高性能計(jì)算系統(tǒng)絕大多數(shù)都將是由可擴(kuò)放的高速互連網(wǎng)絡(luò)連接的基于商用微處理器的對(duì)稱多處理機(jī)(SMP)機(jī)群

OverviewReviewofLec11SMP中的同步MPP當(dāng)前高性能計(jì)算機(jī)介紹高性能計(jì)算機(jī)未來從Top500看高性能計(jì)算的現(xiàn)狀最快的高性能計(jì)算機(jī):1.1PTflops(IBMRoadrunner)中國制造的最快的高性能計(jì)算機(jī):180Tflops(Dawning5000A)最普遍的高性能計(jì)算

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論