MapReduce海量數(shù)據(jù)并行處理ch.04

上傳人：q*** IP屬地：湖北上傳時間：2023-02-04 格式：PPT 頁數(shù)：44 大?。?.37MB 積分：28 舉報 版權(quán)申訴

已閱讀5頁，還剩39頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

Ch.4.Hadoop

MapReduce基本構(gòu)架南京大學(xué)計算機(jī)科學(xué)與技術(shù)系主講人：黃宜華2011年春季學(xué)期MapReduce海量數(shù)據(jù)并行處理鳴謝：本課程得到Google公司(北京）中國大學(xué)合作部精品課程計劃資助Ch.4.

Hadoop

MapReduce基本構(gòu)架1.Hadoop分布式文件系統(tǒng)HDFS2.HadoopMapReduce的基本工作原理3.分布式結(jié)構(gòu)化數(shù)據(jù)表HBaseHDFS的基本特征模仿GoogleGFS設(shè)計實現(xiàn)存儲極大數(shù)目的信息（terabytesorpetabytes），將數(shù)據(jù)保存到大量的節(jié)點當(dāng)中；支持很大的單個文件。提供數(shù)據(jù)的高可靠性和容錯能力，單個或者多個節(jié)點不工作，對系統(tǒng)不會造成任何影響，數(shù)據(jù)仍然可用。通過一定數(shù)量的數(shù)據(jù)復(fù)制保證數(shù)據(jù)存儲的可靠性和出錯恢復(fù)能力。提供對數(shù)據(jù)的快速訪問；并提供良好的可擴(kuò)展性，通過簡單加入更多服務(wù)器快速擴(kuò)充系統(tǒng)容量，服務(wù)更多的客戶端。與GFS類似，HDFS是MapReduce的底層數(shù)據(jù)存儲支撐，并使得數(shù)據(jù)盡可能根據(jù)其本地局部性進(jìn)行訪問與計算。

1.Hadoop的分布式文件系統(tǒng)HDFSHDFS的基本特征HDFS對順序讀進(jìn)行了優(yōu)化，支持大量數(shù)據(jù)的快速順序讀出，代價是對于隨機(jī)的訪問負(fù)載較高。數(shù)據(jù)支持一次寫入，多次讀取；不支持已寫入數(shù)據(jù)的更新操作。數(shù)據(jù)不進(jìn)行本地緩存（文件很大，且順序讀沒有局部性）基于塊的文件存儲，默認(rèn)的塊的大小是64MB減少元數(shù)據(jù)的量有利于順序讀寫（在磁盤上數(shù)據(jù)順序存放）

多副本數(shù)據(jù)塊形式存儲，按照塊的方式隨機(jī)選擇存儲節(jié)點，默認(rèn)副本數(shù)目是3Hadoop的分布式文件系統(tǒng)HDFSHDFS基本構(gòu)架對等于GFS

Master對等于GFS

ChunkServer應(yīng)用程序HDFS客戶端文件名或數(shù)據(jù)塊號數(shù)據(jù)塊號，數(shù)據(jù)塊位置HDFSNameNodeDataNode數(shù)據(jù)DataNode數(shù)據(jù)DataNode數(shù)據(jù)Hadoop的分布式文件系統(tǒng)HDFSHDFS基本實現(xiàn)構(gòu)架Hadoop的分布式文件系統(tǒng)HDFSHDFS數(shù)據(jù)分布設(shè)計多副本數(shù)據(jù)塊形式存儲，按照塊的方式隨機(jī)選擇存儲節(jié)點默認(rèn)副本數(shù)目是3Hadoop的分布式文件系統(tǒng)HDFSHDFS數(shù)據(jù)分布設(shè)計Hadoop的分布式文件系統(tǒng)HDFSHDFS可靠性與出錯恢復(fù)DataNode節(jié)點的檢測心跳：NameNode不斷檢測DataNode是否有效若失效，則尋找新的節(jié)點替代，將失效節(jié)點數(shù)據(jù)重新分布集群負(fù)載均衡數(shù)據(jù)一致性:校驗和checksum主節(jié)點元數(shù)據(jù)失效MultipleFsImageandEditLogCheckpointHadoop的分布式文件系統(tǒng)HDFSHDFS設(shè)計要點命名空間副本選擇RackAwareness安全模式剛啟動的時候，等待每一個DataNode報告情況退出安全模式的時候才進(jìn)行副本復(fù)制操作NameNode有自己的FsImage和EditLog，前者有自己的文件系統(tǒng)狀態(tài)，后者是還沒有更新的記錄Hadoop的分布式文件系統(tǒng)HDFSHDFS的安裝和啟動下載hadoop-0.20.1.tar.gz（或者最新版本0.21）tarzxvfhadoop-0.20.1.tar.gz，解壓后Hadoop系統(tǒng)包括HDFS和所有配置文件都在指定的文件目錄中在Linux下進(jìn)行必要的系統(tǒng)配置設(shè)置與Hadoop相關(guān)的Java運(yùn)行環(huán)境變量啟動Java虛擬機(jī)啟動Hadoop，則Hadoop和HDFS文件系統(tǒng)開始運(yùn)行Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令建立用戶自己的目錄，用戶目錄在/user中，需要建立用-put命令在Linux文件系統(tǒng)與HDFS之間復(fù)制數(shù)據(jù)文件-put等同于-copyFromLocalsomeone@anynode:hadoop$bin/hadoop

dfs-lssomeone@anynode:hadoop$someone@anynode:hadoop$bin/hadoopdfs-ls/Found2itemsdrwxr-xr-x-hadoopsupergroup02008-09-2019:40/hadoopdrwxr-xr-x-hadoopsupergroup02008-09-2020:08/tmpsomeone@anynode:hadoop$bin/hadoop

dfs-mkdir/usersomeone@anynode:hadoop$bin/hadoop

dfs-mkdir/user/someonesomeone@anynode:hadoop$bin/hadoop

dfs-put/home/someone/interestingFile.txt/user/yourUserName/Put上傳整個目錄someone@anynode:hadoop$bin/hadoop

dfs–putsource-directory

destinationHadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令Command:Assuming:Outcome:bin/hadoopdfs-putfoobarNofile/directorynamed/user/$USER/barexistsinHDFSUploadslocalfilefootoafilenamed/user/$USER/barbin/hadoopdfs-putfoobar/user/$USER/barisadirectoryUploadslocalfilefootoafilenamed/user/$USER/bar/foobin/hadoopdfs-putfoosomedir/somefile/user/$USER/somedirdoesnotexistinHDFSUploadslocalfilefootoafilenamed/user/$USER/somedir/somefile,creatingthemissingdirectorybin/hadoopdfs-putfoobar/user/$USER/barisalreadyafileinHDFSNochangeinHDFS,andanerrorisreturnedtotheuser.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-lspathListsthecontentsofthedirectoryspecifiedbypath,showingthenames,permissions,owner,sizeandmodificationdateforeachentry.-lsrpathBehaveslike-ls,butrecursivelydisplaysentriesinallsubdirectoriesofpath.-dupathShowsdiskusage,inbytes,forallfileswhichmatchpath;filenamesarereportedwiththefullHDFSprotocolprefix.-duspathLike-du,butprintsasummaryofdiskusageofallfiles/directoriesinthepath.-mvsrcdestMovesthefileordirectoryindicatedbysrctodest,withinHDFS.-cpsrcdestCopiesthefileordirectoryidentifiedbysrctodest,withinHDFS.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-rmpathRemovesthefileoremptydirectoryidentifiedbypath.-rmrpathRemovesthefileordirectoryidentifiedbypath.Recursivelydeletesanychildentries(i.e.,filesorsubdirectoriesofpath).-putlocalSrcdestCopiesthefileordirectoryfromthelocalfilesystemidentifiedbylocalSrctodestwithintheHDFS.-copyFromLocallocalSrcdestIdenticalto-put-moveFromLocallocalSrcdestCopiesthefileordirectoryfromthelocalfilesystemidentifiedbylocalSrctodestwithinHDFS,thendeletesthelocalcopyonsuccess.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-get[-crc]srclocalDestCopiesthefileordirectoryinHDFSidentifiedbysrctothelocalfilesystempathidentifiedbylocalDest.-getmergesrclocalDest[addnl]RetrievesallfilesthatmatchthepathsrcinHDFS,andcopiesthemtoasingle,mergedfileinthelocalfilesystemidentifiedbylocalDest.-catfilenameDisplaysthecontentsoffilenameonstdout.-copyToLocal[-crc]srclocalDestIdenticalto-get-moveToLocal[-crc]srclocalDestWorkslike-get,butdeletestheHDFScopyonsuccess.-mkdirpathCreatesadirectorynamedpathinHDFS.Createsanyparentdirectoriesinpaththataremissing(e.g.,likemkdir-pinLinux).Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-setrep[-R][-w]reppathSetsthetargetreplicationfactorforfilesidentifiedbypathtorep.(Theactualreplicationfactorwillmovetowardthetargetovertime)-touchzpathCreatesafileatpathcontainingthecurrenttimeasatimestamp.Failsifafilealreadyexistsatpath,unlessthefileisalreadysize0.-test-[ezd]pathReturns1ifpathexists;haszerolength;orisadirectory,or0otherwise.-stat[format]pathPrintsinformationaboutpath.formatisastringwhichacceptsfilesizeinblocks(%b),filename(%n),blocksize(%o),replication(%r),andmodificationdate(%y,%Y).-tail[-f]fileShowsthelast1KBoffileonstdout.Hadoop的分布式文件系統(tǒng)HDFSHDFS文件系統(tǒng)操作命令-chmod[-R]mode,mode,...path...Changesthefilepermissionsassociatedwithoneormoreobjectsidentifiedbypath....Performschangesrecursivelywith-R.modeisa3-digitoctalmode,or{augo}+/-{rwxX}.Assumesaifnoscopeisspecifiedanddoesnotapplyaumask.-chown[-R][owner][:[group]]path...Setstheowninguserand/orgroupforfilesordirectoriesidentifiedbypath....Setsownerrecursivelyif-Risspecified.-chgrp[-R]grouppath...Setstheowninggroupforfilesordirectoriesidentifiedbypath....Setsgrouprecursivelyif-Risspecified.-helpcmdReturnsusageinformationforoneofthecommandslistedabove.Youmustomittheleading'-'characterincmdHadoop的分布式文件系統(tǒng)HDFSHDFSAdmin命令獲得HDFS總體的狀態(tài)bin/hadoopdfsadmin–reportbin/hadoopdfsadmin-metasavefilename

whatthestateoftheNameNode'smetadataisSafemodeSafemodeisanHDFSstateinwhichthefilesystemismountedread-only;noreplicationisperformed,norcanfilesbecreatedordeleted.bin/hadoopdfsadmin–safemodeenter/leave/get/waitHadoop的分布式文件系統(tǒng)HDFSHDFSAdmin命令更改HDFS成員升級HDFS版本bin/start-dfs.sh–upgrade(第一次運(yùn)行新版本的時候使用)bin/hadoopdfsadmin–upgradeProgressstatusbin/hadoopdfsadmin–upgradeProgressdetailsbin/hadoopdfsadmin–upgradeProgressforce(onyourownrisk!)bin/start-dfs.sh–rollback(在舊版本重新安裝后使用)(onyourownrisk!)幫助bin/admindfsadmin-helpHadoop的分布式文件系統(tǒng)HDFS負(fù)載均衡加入一個新節(jié)點的步驟配置新節(jié)點上的hadoop程序在Master的slaves文件中加入新的slave節(jié)點啟動slave節(jié)點上的DataNode，會自動去聯(lián)系NameNode，加入到集群中Balancer類用來做負(fù)載均衡，默認(rèn)的均衡參數(shù)是10%范圍內(nèi)bin/start-balancer.sh–threshold5bin/stop-balancer.sh隨時可以停止負(fù)載均衡的工作Hadoop的分布式文件系統(tǒng)HDFS在MapReduce程序中使用HDFS通過的配置選項，HadoopMapReduce程序可以自動從NameNode中獲得文件的情況HDFS接口包括：命令行接口HadoopMapReduceJob的隱含的輸入Java程序直接操作libhdfs從c/c++程序中操作Hadoop的分布式文件系統(tǒng)HDFSHDFS權(quán)限控制與安全特性類似于POSIX的安全特性不完全，主要預(yù)防操作失誤不是一個強(qiáng)的安全模型，不能保證操作的完全安全性bin/hadoopdfs–chmod,-chown,-chgrp用戶:當(dāng)前登錄的用戶名,即使用Linux自身設(shè)定的用戶與組的概念超級用戶:TheusernamewhichwasusedtostarttheHadoopprocess(i.e.,theusernamewhoactuallyranbin/start-all.shorbin/start-dfs.sh)isacknowledgedtobethesuperuserforHDFS.IfthisuserinteractswithHDFS,hedoessowithaspecialusernamesuperuser.IfHadoopisshutdownandrestartedunderadifferentusername,thatusernameisthenboundtothesuperuseraccount.超級用戶組

配置參數(shù)：dfs.permissions.supergroupHadoop的分布式文件系統(tǒng)HDFSHadoopMapReduce基本構(gòu)架與工作過程2.Hadoop

MapReduce的基本工作原理對等于GoogleMapReduce中的Master對等于GoogleMapReduce中的WorkerdatanodedaemonLinuxfilesystem…tasktrackerslavenodedatanodedaemonLinuxfilesystem…tasktrackerslavenodedatanodedaemonLinuxfilesystem…tasktrackerslavenodenamenodenamenodedaemonjobsubmissionnodejobtrackerHadoop

MapReduce的基本工作原理HadoopMapReduce基本構(gòu)架與工作過程數(shù)據(jù)存儲與計算節(jié)點構(gòu)架HadoopMapReduce基本工作過程Hadoop

MapReduce的基本工作原理HadoopMapReduce主要組件Hadoop

MapReduce的基本原理文件輸入格式InputFormat定義了數(shù)據(jù)文件如何分割和讀取InputFile提供了以下一些功能選擇文件或者其它對象，用來作為輸入定義InputSplits，將一個文件分開成為任務(wù)為RecordReader提供一個工廠，用來讀取這個文件有一個抽象的類FileInputFormat，所有的輸入格式類都從這個類繼承這個類的功能以及特性。當(dāng)啟動一個Hadoop任務(wù)的時候，一個輸入文件所在的目錄被輸入到FileInputFormat對象中。FileInputFormat從這個目錄中讀取所有文件。然后FileInputFormat將這些文件分割為一個或者多個InputSplits。通過在JobConf對象上設(shè)置JobConf.setInputFormat設(shè)置文件輸入的格式HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理文件輸入格式InputFormatInputFormat:Description:Key:Value:TextInputFormatDefaultformat;readslinesoftextfilesThebyteoffsetofthelineThelinecontentsKeyValueTextInputFormatParseslinesintokey-valpairsEverythinguptothefirsttabcharacterTheremainderofthelineSequenceFileInputFormatAHadoop-specifichigh-performancebinaryformatuser-defineduser-definedHadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理輸入數(shù)據(jù)分塊InputSplitsInputSplit定義了輸入到單個Map

任務(wù)的輸入數(shù)據(jù)一個MapReduce程序被統(tǒng)稱為

一個Job，可能有上百個任務(wù)構(gòu)成InputSplit將文件分為64MB的大小配置文件hadoop-site.xml中的mapred.min.split.size參數(shù)控制這個大小mapred.tasktracker.map.taks.maximum用來控制某一個節(jié)點上所有map任務(wù)的最大數(shù)目HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理數(shù)據(jù)記錄讀入RecordReaderInputSplit定義了一項工作的大小，

但是沒有定義如何讀取數(shù)據(jù)RecordReader實際上定義了如何

從數(shù)據(jù)上轉(zhuǎn)化為一個(key,value)對

的詳細(xì)方法，并將數(shù)據(jù)輸出到Mapper類中TextInputFormat提供了LineRecordReaderHadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Mapper每一個Mapper類的實例生成了

一個Java進(jìn)程（在某一個InputSplit上執(zhí)行）有兩個額外的參數(shù)OutputCollector

以及Reporter，前者用來收集中間

結(jié)果，后者用來獲得環(huán)境參數(shù)以及設(shè)置當(dāng)前執(zhí)行的狀態(tài)?，F(xiàn)在用Mapper.Context提供給每一個Mapper函數(shù)，用來提供上面兩個對象的功能HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Combiner合并相同key的鍵值對，減少partitioner時候的數(shù)據(jù)通信開銷conf.setCombinerClass(Reduce.class);是在本地執(zhí)行的一個Reducer，滿足一定的條件才能夠執(zhí)行。HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Partitioner&Shuffle在Map工作完成之后，每一個Map函數(shù)會將結(jié)果傳到對應(yīng)的Reducer所在的節(jié)點，此時，用戶可以提供一個Partitioner類，用來決定一個給定的(key,value)對傳輸?shù)木唧w位置Sort傳輸?shù)矫恳粋€節(jié)點上的所有的Reduce函數(shù)接收到得Key,value對會被Hadoop自動排序（即Map生成的結(jié)果傳送到某一個節(jié)點的時候，會被自動排序）HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理Reducer做用戶定義的Reduce操作接收到一個OutputCollector的

類作為輸出最新的編程接口是Reducer.ContextHadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理文件輸出格式OutputFormat寫入到HDFS的所有OutputFormat都繼承自FileOutputFormat每一個Reducer都寫一個文件到一個共同的輸出目錄，文件名是part-nnnnn，其中nnnnn是與每一個reducer相關(guān)的一個號（partitionid）FileOutputFormat.setOutputPath()JobConf.setOutputFormat()HadoopMapReduce主要組件Hadoop

MapReduce的基本工作原理文件輸出格式OutputFormatRecordWriterTextOutputFormat實現(xiàn)了缺省的LineRecordWriter，以”key\tvalue”形式輸出一行結(jié)果。OutputFormat:DescriptionTextOutputFormatDefault;writeslinesin"key\tvalue"formSequenceFileOutputFormatWritesbinaryfilessuitableforreadingintosubsequentMapReducejobsNullOutputFormatDisregardsitsinputs容錯處理與計算性能優(yōu)化Hadoop

MapReduce的基本工作原理由Hadoop系統(tǒng)自己解決主要方法是將失敗的任務(wù)進(jìn)行再次執(zhí)行TaskTracker會把狀態(tài)信息匯報給JobTracker，最終由JobTracker決定重新執(zhí)行哪一個任務(wù)為了加快執(zhí)行的速度，Hadoop也會自動重復(fù)執(zhí)行同一個任務(wù)，以最先執(zhí)行成功的為準(zhǔn)（投機(jī)執(zhí)行）mapred.map.tasks.speculative.executionmapred.red

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

MapReduce海量數(shù)據(jù)并行處理ch.04

文檔簡介

溫馨提示

最新文檔

評論

MapReduce海量數(shù)據(jù)并行處理ch.04

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔