版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
GoogleClusterComputingFacultyTrainingWorkshopModuleV:HadoopTechnicalReview?SpinnakerLabs,Inc.OverviewHadoopTechnicalWalkthroughHDFSDatabasesUsingHadoopinanAcademicEnvironmentPerformancetipsandothertools?SpinnakerLabs,Inc.YouSay,“tomato…”Googlecallsit:Hadoopequivalent:MapReduceHadoopGFSHDFSBigtableHBaseChubby(nothingyet…butplanned)SomeMapReduceTerminologyJob–A“fullprogram”-anexecutionofaMapperandReduceracrossadatasetTask–AnexecutionofaMapperoraReduceronasliceofdataa.k.a.Task-In-Progress(TIP)TaskAttempt–Aparticularinstanceofanattempttoexecuteataskonamachine?SpinnakerLabs,Inc.TerminologyExampleRunning“WordCount”across20filesisonejob20filestobemappedimply20maptasks+somenumberofreducetasksAtleast20maptaskattemptswillbeperformed…moreifamachinecrashes,etc.?SpinnakerLabs,Inc.TaskAttemptsAparticulartaskwillbeattemptedatleastonce,possiblymoretimesifitcrashesIfthesameinputcausescrashesoverandover,thatinputwilleventuallybeabandonedMultipleattemptsatonetaskmayoccurinparallelwithspeculativeexecutionturnedonTaskIDfromTaskInProgressisnotauniqueidentifier;don’tuseitthatway?SpinnakerLabs,Inc.MapReduce:HighLevel?SpinnakerLabs,Inc.Node-to-NodeCommunicationHadoopusesitsownRPCprotocolAllcommunicationbeginsinslavenodesPreventscircular-waitdeadlockSlavesperiodicallypollfor“status”messageClassesmustprovideexplicitserialization?SpinnakerLabs,Inc.Nodes,Trackers,TasksMasternoderunsJobTrackerinstance,whichacceptsJobrequestsfromclientsTaskTrackerinstancesrunonslavenodesTaskTrackerforksseparateJavaprocessfortaskinstances?SpinnakerLabs,Inc.JobDistributionMapReduceprogramsarecontainedinaJava“jar”file+anXMLfilecontainingserializedprogramconfigurationoptionsRunningaMapReducejobplacesthesefilesintotheHDFSandnotifiesTaskTrackerswheretoretrievetherelevantprogramcode…Where’sthedatadistribution??SpinnakerLabs,Inc.DataDistributionImplicitindesignofMapReduce!Allmappersareequivalent;somapwhateverdataislocaltoaparticularnodeinHDFSIflotsofdatadoeshappentopileuponthesamenode,nearbynodeswillmapinsteadDatatransferishandledimplicitlybyHDFS?SpinnakerLabs,Inc.ConfiguringWithJobConfMRProgramshavemanyconfigurableoptionsJobConfobjectshold(key,value)componentsmappingString’ae.g.,“mapred.map.tasks”20JobConfisserializedanddistributedbeforerunningthejobObjectsimplementingJobConfigurablecanretrieveelementsfromaJobConf?SpinnakerLabs,Inc.WhatHappensInMapReduce?
DepthFirst?SpinnakerLabs,Inc.JobLaunchProcess:ClientClientprogramcreatesaJobConfIdentifyclassesimplementingMapperandReducerinterfacesJobConf.setMapperClass(),setReducerClass()Specifyinputs,outputsJobConf.setInputPath(),setOutputPath()Optionally,otheroptionstoo:JobConf.setNumReduceTasks(),JobConf.setOutputFormat()…?SpinnakerLabs,Inc.JobLaunchProcess:JobClientPassJobConftoJobClient.runJob()orsubmitJob()runJob()blocks,submitJob()doesnotJobClient:DeterminesproperdivisionofinputintoInputSplitsSendsjobdatatomasterJobTrackerserver?SpinnakerLabs,Inc.JobLaunchProcess:JobTrackerJobTracker:InsertsjarandJobConf(serializedtoXML)insharedlocationPostsaJobInProgresstoitsrunqueue?SpinnakerLabs,Inc.JobLaunchProcess:TaskTrackerTaskTrackersrunningonslavenodesperiodicallyqueryJobTrackerforworkRetrievejob-specificjarandconfigLaunchtaskinseparateinstanceofJavamain()isprovidedbyHadoop?SpinnakerLabs,Inc.JobLaunchProcess:TaskTaskTracker.Child.main():SetsupthechildTaskInProgressattemptReadsXMLconfigurationConnectsbacktonecessaryMapReducecomponentsviaRPCUsesTaskRunnertolaunchuserprocess?SpinnakerLabs,Inc.JobLaunchProcess:TaskRunnerTaskRunner,MapTaskRunner,MapRunnerworkinadaisy-chaintolaunchyourMapperTaskknowsaheadoftimewhichInputSplitsitshouldbemappingCallsMapperonceforeachrecordretrievedfromtheInputSplitRunningtheReducerismuchthesame?SpinnakerLabs,Inc.CreatingtheMapperYouprovidetheinstanceofMapperShouldextendMapReduceBaseOneinstanceofyourMapperisinitializedbytheMapTaskRunnerforaTaskInProgressExistsinseparateprocessfromallotherinstancesofMapper–nodatasharing!?SpinnakerLabs,Inc.Mappervoidmap(WritableComparablekey, Writablevalue, OutputCollectoroutput, Reporterreporter)?SpinnakerLabs,Inc.WhatisWritable?Hadoopdefinesitsown“box”classesforstrings(Text),integers(IntWritable),etc.AllvaluesareinstancesofWritableAllkeysareinstancesofWritableComparable?SpinnakerLabs,Inc.WritingForCacheCoherencywhile(moreinputexists){ myIntermediate=newintermediate(input); myIcess(); exportoutputs;}?SpinnakerLabs,Inc.WritingForCacheCoherencymyIntermediate=newintermediate(junk);while(moreinputexists){ myIntermediate.setupState(input); myIcess(); exportoutputs;}?SpinnakerLabs,Inc.WritingForCacheCoherencyRunningtheGCtakestimeReusinglocationsallowsbettercacheusageSpeedupcanbeasmuchastwo-foldAllserializabletypesmustbeWritableanyway,somakeuseoftheinterface?SpinnakerLabs,Inc.GettingDataToTheMapperReadingDataDatasetsarespecifiedbyInputFormatsDefinesinputdata(e.g.,adirectory)IdentifiespartitionsofthedatathatformanInputSplitFactoryforRecordReaderobjectstoextract(k,v)recordsfromtheinputsource?SpinnakerLabs,Inc.FileInputFormatandFriendsTextInputFormat–Treatseach‘\n’-terminatedlineofafileasavalueKeyValueTextInputFormat–Maps‘\n’-terminatedtextlinesof“kSEPv”SequenceFileInputFormat–Binaryfileof(k,v)pairswithsomeadd’lmetadataSequenceFileAsTextInputFormat–Same,butmaps(k.toString(),v.toString())?SpinnakerLabs,Inc.FilteringFileInputsFileInputFormatwillreadallfilesoutofaspecifieddirectoryandsendthemtothemapperDelegatesfilteringthisfilelisttoamethodsubclassesmayoverridee.g.,Createyourown“xyzFileInputFormat”toread*.xyzfromdirectorylist?SpinnakerLabs,Inc.RecordReadersEachInputFormatprovidesitsownRecordReaderimplementationProvides(unused?)capabilitymultiplexingLineRecordReader–ReadsalinefromatextfileKeyValueRecordReader–UsedbyKeyValueTextInputFormat?SpinnakerLabs,Inc.InputSplitSizeFileInputFormatwilldividelargefilesintochunksExactsizecontrolledbymapred.min.split.sizeRecordReadersreceivefile,offset,andlengthofchunkCustomInputFormatimplementationsmayoverridesplitsize–e.g.,“NeverChunkFile”?SpinnakerLabs,Inc.SendingDataToReducersMapfunctionreceivesOutputCollectorobjectOutputCollector.collect()takes(k,v)elementsAny(WritableComparable,Writable)canbeused?SpinnakerLabs,Inc.WritableComparatorComparesWritableComparabledataWillcallWritableCpare()CanprovidefastpathforserializeddataJobConf.setOutputValueGroupingComparator()?SpinnakerLabs,Inc.SendingDataToTheClientReporterobjectsenttoMapperallowssimpleasynchronousfeedbackincrCounter(Enumkey,longamount)setStatus(Stringmsg)Allowsself-identificationofinputInputSplitgetInputSplit()?SpinnakerLabs,Inc.PartitionAndShufflePartitionerintgetPartition(key,val,numPartitions)OutputsthepartitionnumberforagivenkeyOnepartition==valuessenttooneReducetaskHashPartitionerusedbydefaultUseskey.hashCode()toreturnpartitionnumJobConfsetsPartitionerimplementation?SpinnakerLabs,Inc.Reductionreduce( WritableComparablekey, Iteratorvalues, OutputCollectoroutput, Reporterreporter)Keys&valuessenttoonepartitionallgotothesamereducetaskCallsaresortedbykey–“earlier”keysarereducedandoutputbefore“l(fā)ater”keys?SpinnakerLabs,Inc.Finally:WritingTheOutput?SpinnakerLabs,Inc.OutputFormatAnalogoustoInputFormatTextOutputFormat–Writes“keyval\n”stringstooutputfileSequenceFileOutputFormat–Usesabinaryformattopack(k,v)pairsNullOutputFormat–Discardsoutput?SpinnakerLabs,Inc.HDFS?SpinnakerLabs,Inc.HDFSLimitations“Almost”GFSNofileupdateoptions(recordappend,etc);allfilesarewrite-onceDoesnotimplementdemandreplicationDesignedforstreamingRandomseeksdevastateperformance?SpinnakerLabs,Inc.NameNode“Head”interfacetoHDFSclusterRecordsallglobalmetadata?SpinnakerLabs,Inc.SecondaryNameNodeNotafailoverNameNode!Recordsmetadatasnapshotsfrom“real”NameNodeCanmergeupdatelogsinflightCanuploadsnapshotbacktoprimary?SpinnakerLabs,Inc.NameNodeDeathNonewrequestscanbeservedwhileNameNodeisdownSecondarywillnotfailoverasnewprimarySowhyhaveasecondaryatall??SpinnakerLabs,Inc.NameNodeDeath,cont’dIfNameNodediesfromsoftwareglitch,justrebootButifmachineishosed,metadataforclusterisirretrievable!?SpinnakerLabs,Inc.BringingtheClusterBackIforiginalNameNodecanberestored,secondarycanre-establishthemostcurrentmetadatasnapshotIfnot,createanewNameNode,usesecondarytocopymetadatatonewprimary,restartwholecluster()Isthereanotherway…??SpinnakerLabs,Inc.KeepingtheClusterUpProblem:DataNodes“fix”theaddressoftheNameNodeinmemory,can’tswitchinflightSolution:BringnewNameNodeup,butuseDNStomakeclusterbelieveit’stheoriginaloneSecondarycanbethe“new”one?SpinnakerLabs,Inc.FurtherReliabilityMeasuresNamenodecanoutputmultiplecopiesofmetadatafilestodifferentdirectoriesIncludinganNFSmountedoneMaydegradeperformance;watchforNFSlocks?SpinnakerLabs,Inc.Databases?SpinnakerLabs,Inc.LifeAfterGFSStraightGFSfilesarenottheonlystorageoptionHBase(ontopofGFS)providescolumn-orientedstoragemySQLandotherdbenginesstillrelevant?SpinnakerLabs,Inc.HBaseCaninterfacedirectlywithHadoopProvidesitsownInput-andOutputFormatclasses;sendsrowsdirectlytomapper,receivesnewrowsfromreducer…Butmightnotbereadyforclassroomuse(leaststablecomponent)?SpinnakerLabs,Inc.MySQLClusteringMySQLdatabasecanbeshardedonmultipleserversForfastIO,usesamemachinesasHadoopTablescanbesplitacrossmachinesbyrowkeyrangeMultiplereplicascanservesametable?SpinnakerLabs,Inc.Sharding&HadoopPartitionersForbestperformance,ReducershouldgostraighttolocalmysqlinstanceGetalldataintherightmachineinonecopyImplementcustomPartitionertoensureparticularkeyrangegoestomysql-awareReducer?SpinnakerLabs,Inc.AcademicHadoopRequirements?SpinnakerLabs,Inc.ServerProfileUWcluster:40nodes,80processorstotal2GBram/processor24TBrawstoragespace(8TBreplicated)OnenodereservedforJobTracker/NameNodeTwomorewouldn’tcooperate…Butstillvastlyoverpowered?SpinnakerLabs,Inc.Setup&MaintenanceTookabouttwodaystosetupandconfigureMostlyhardware-relatedissuesHadoopsetupwasonlyacouplehoursMaintenance:onlyafewhours/weekMostlyrebootingtheclusterwhenjobsgotstuck?SpinnakerLabs,Inc.TotalUsageAbout15,000CPU-hoursconsumedby20students…Outof130,000availableoverquarterAverageloadisabout12%?SpinnakerLabs,Inc.Analyzingstudentusagepatterns?SpinnakerLabs,Inc.NotQuitetheWholeStoryRealistically,studentsdidmostworkveryclosetodeadlineClustersatunusedforafewdays,followedbyoverloadingfortwodaysstraight?SpinnakerLabs,Inc.AnalyzingstudentusagepatternsLesson:ResourcedemandsareNOTconstant!?SpinnakerLabs,Inc.HadoopJobSchedulingFIFOqueuematchesincomingjobstoavailablenodesNonotionoffairnessNeverswitchesoutrunningjobRun-awaytaskscouldstarveotherstudentjobs?SpinnakerLabs,Inc.HadoopSecurityButonthebright(?)side:NosecuritysystemforjobsAnyonecanstartajob;buttheycanalsocancelotherjobsRealistically,studentsdidnotcancelotherstudentjobs,evenwhentheyshould?SpinnakerLabs,Inc.HadoopSecurity:TheDarkSideNopermissionsinHDFSeitherJustnowaddedin0.16OnestudentdeletedthecommondatasetforaprojectEmailsubject:“Oops…”Nostudentscouldtesttheircodeuntildatasetrestoredfrombackup?SpinnakerLabs,Inc.JobSchedulingLessonsGettingstudentsto“playnice”ishardNoincentiveJustplainbad/buggycodeClustercontentioncausedproblemsatdeadlinesWorkingroupsStaggerdeadlines?SpinnakerLabs,Inc.AnotherPossibilityAmazonEC2provideson-demandserversMaybeabletohavestudentsusetheseforjobs“Labfee”wouldbe~$150/studentSimpleweb-basedinterfacesexistRHadoopOnDemand(HOD)comingsoonInjectsnewnodesintoliveclusters?SpinnakerLabs,Inc.MorePerformance&Scalability?SpinnakerLabs,Inc.NumberofTasksMappers=10*nodes(or3/2*cores)Reducers=2*nodes(or1.05*cores)Twodegreesoffreedominmapperruntime:Numberoftasks/node,andsizeofInputSplitsSee/lucene-hadoop/HowManyMapsAndReduces?SpinnakerLabs,Inc.MorePerformanceTweaksHadoopdefaultstoheapcapof200MBSet:mapred.child.java.opts=-Xmx512m1024MB/processmayalsobeappropriateDFSblocksizeis64MBForhugefiles,setdfs.block.size=134217728mapred.reduce.parallel.copiesSetto15—50;moredata=>morecopies?SpinnakerLabs,Inc.DeadTasksStudentjobswould“runaway”,adminrestartneededVeryoftenstuckinhugeshuffleprocessStudentsdidnotknowaboutPartitionerclass,mayhavehadnon-uniformdistribu
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 現(xiàn)代化養(yǎng)殖場技術(shù)員聘用合同
- 劇院大理石供應(yīng)合同
- 歷史建筑外墻保溫施工合同模板
- 國際咨詢租賃合同模板
- 語言學(xué)校暖氣安裝施工合同
- 溫泉度假村假山施工合同
- 旅游導(dǎo)游班主任招聘合同
- 宅基地他用權(quán)協(xié)議
- 水上運(yùn)動泵機(jī)租賃合同
- 健身市場污水排放系統(tǒng)安裝合同
- 2023年電大建筑制圖基礎(chǔ)專科必修期末試題及答案
- 離合器的相關(guān)計(jì)算
- 血細(xì)胞分析儀的應(yīng)用及形態(tài)學(xué)復(fù)檢
- 第5章 一元函數(shù)的導(dǎo)數(shù)及其應(yīng)用【知識導(dǎo)圖 】 高考數(shù)學(xué)復(fù)習(xí)思維導(dǎo)圖(人教A版2019)(必修第一冊)
- 醫(yī)療安全不良事件管理培訓(xùn).x
- 《水利水電工程等級劃分及洪水標(biāo)準(zhǔn)》 SL252-2000
- 油浸式變壓器油箱
- 人體解剖生理學(xué)(全套上冊)-課件
- 矩形波導(dǎo)中電磁波的傳播模式
- 一年級4.2【章節(jié)知識精講】6-9的合與分
- 醫(yī)院放射科核輻射安全隱患排查情況自查報(bào)告
評論
0/150
提交評論