分析分布式系統(tǒng)_第1頁
分析分布式系統(tǒng)_第2頁
分析分布式系統(tǒng)_第3頁
分析分布式系統(tǒng)_第4頁
分析分布式系統(tǒng)_第5頁
已閱讀5頁,還剩37頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、百度系統(tǒng)部Hadoop Distributed File SystemWhat is HadoopOpen Source, JavaApache開源組織下Lucene(開源搜索引擎)的一個子項目 map-reduce engine + HDFS(+Hbase) Hadoop不應(yīng)該簡簡單單地被認(rèn)為是一個分布式文件系統(tǒng),實(shí)際上Hadoop是一套完善的分布式計算和存儲基礎(chǔ)設(shè)施。 What is HDFSHDFS(Hadoop Distributed filesystem)被設(shè)計用來在大型集群上(由普通硬件設(shè)備組成)執(zhí)行分布式應(yīng)用的底層框架,而并非一個單純用于存儲的分布式文件系統(tǒng)適合大數(shù)據(jù)集的應(yīng)用程序

2、高可靠性和高可用性支持map-reduce編程模型其它類GFS系統(tǒng)KFS(Kosmos Filesystem), 來自startup垂直搜索引擎的開源項目, c+ , Kosmix 僅僅是一個文件系統(tǒng),沒有MapReduce層Backing store for other open source projects: Hadoop (provides a Map/Reduce implementation ) Hypertable (provides a Big-Table interface, Zvents Inc)DisadvantageGFS支持低效的re-write和高效的并發(fā)appen

3、d操作,而HDFS目前還不支持rewrite和append。HDFS只允許一次性地創(chuàng)建文件,創(chuàng)建時就需要寫入數(shù)據(jù),一旦創(chuàng)建完畢就不能再修改,嚴(yán)格的遵守“one-writer-write-once & read-many” 。 然而,現(xiàn)在有很多應(yīng)用對append都有需求。比如,不斷往HDFS中的一個文件進(jìn)行日志追加。Our plan實(shí)現(xiàn)單一Client端append和truncate: HDFS允許多次打開文件進(jìn)行修改(append和truncate),每一次都只允許一個client進(jìn)行修改,修改的過程中允許多個client并發(fā)讀。 ArchitectureMaster/Slave Arch.a

4、 single namenode and multiple datanodesNamenodeexecutes file system namespace operations like opening, closing, and renaming files and directoriesdetermines the mapping of blocks to DatanodesArchitectureDatanodesDatanodes are responsible for serving read and write requests from the file systems clie

5、nts. Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode.ArchitectureNamenodeServes as both diretory namespace manager and “inode table”Filename-blocksequence(namespace), stored on disk and is very preciousBlock-machinelist(“inodes”), rebuilt every tim

6、e the NameNode comes upNamenodeInitiation:new FSNamesystem:Load FS ImageCheck and trigger safe mode if neededSet the total number of blocks in the systemRecord all blocks that are getting replicatedStart monitorsStart http serverstart RPC server Start Trash Emptier threadMonitorsSafeModeMonitorPerio

7、dically check whether it is time to leave safe mode.PendingReplicationMonitorA periodic thread that scans for blocks that never finished their replication request.HeartbeatMonitorPeriodically Check if there are any expired heartbeats.MonitorsLeaseMonitorPeriodically checks for leases that have expir

8、ed, and disposes of them.ReplicationMonitorPeriodically Look at a few datanodes and compute any replication work that can be scheduled on them. missionedMonitorPeriodically check if any of the nodes being missioned has finished moving all its datablocks to another replica.Data ReplicationStores each

9、 file as a sequence of blocksBlocks of a file are replicated for fault toleranceThe replication factor can be specified at file creation time and can be changed laterFiles in HDFS are write-once and have strictly one writer at any timeData ReplicationData ReplicationThe Namenode makes all decisions

10、regarding replication of blocksNamenode receives Heartbeat and Blockreport from datanodesHeartbeat: Im live! (3 seconds)Blockreport: all blocks on datanode(1 hour)HeartbeatMonitordatanode向namenode發(fā)送heartbeat(TCP)一個間隔內(nèi)沒有收到heartbeat,則認(rèn)為datanode為dead每一次只允許一個datanode被標(biāo)記為dead更新需要復(fù)制的block數(shù)響應(yīng)時攜帶命令:看是否有需要復(fù)制

11、block的工作和需要刪除block的工作要做ReplicationMonitor計算需要復(fù)制的塊,如果沒有復(fù)制工作,就計算需要刪除的塊默認(rèn)每3秒種進(jìn)行一次每次只處理32%的datanode如果某一個datanode的復(fù)制塊負(fù)載比較大,會跳過,而不再添加新的工作(默認(rèn)只能同時處理2個)SafeModenamenode一種特殊的狀態(tài),此時的namenode不接受任何對命名空間的操作,也不進(jìn)行任何副本數(shù)目調(diào)整。namenode啟動的時候會自動進(jìn)入安全模式,接受來自數(shù)據(jù)節(jié)點(diǎn)的心跳和塊報告,并檢查數(shù)據(jù)塊的列表。當(dāng)一個塊的副本數(shù)大于配置的最小復(fù)制數(shù)(dfs.replication.min)時,該塊就被認(rèn)

12、為是安全的;當(dāng)檢測到系統(tǒng)已達(dá)到配置的塊安全復(fù)制比例(dfs.safemode.threshold.pct),namenode會持續(xù)一段時間(通過dfs.safemode.extension配置)的安全模式,讓剩余的datanode完成注冊(check in),就自動退出安全模式。SafeMode可以通過調(diào)用DFSAdmin中的setSafeMode命令手動地進(jìn)入或退出安全模式。 說明:如果threshold配置為0或命名空間為空,namenode啟動時將不會自動進(jìn)入安全模式;如果threshold的值大于1,namenode將只能手動退出。SafemodeMonitor檢查Namonode是否

13、可以離開安全模式 默認(rèn)每1秒種進(jìn)行一次如果可以離開,則退出安全模式,并停止該MonitorLease與鎖的區(qū)別:時限Client在創(chuàng)建文件時,需要先向namenode申請一個lease,目的是為了防止有失效的Client長久地占有節(jié)點(diǎn)服務(wù)器的資源。namenode假定在一段時間后沒有收到Client的lease 更新調(diào)用就認(rèn)為該Client“死掉”,必須釋放掉它在該節(jié)點(diǎn)上持有的資源。namenode使用一種名叫l(wèi)eases的類來實(shí)現(xiàn)這種機(jī)制。每個lease記錄了該lease對應(yīng)的資源(file)、lease持有者(Client)和上次renew lease的時間。Lease客戶端通過周期性地調(diào)

14、用renewLease向namenode表明自己alive,如果namenode在一定的時間內(nèi)沒有收到某個客戶端對該函數(shù)的調(diào)用,便認(rèn)為該客戶端已經(jīng)死掉。 如果lease超時,該lease實(shí)例會使用一個線程來進(jìn)行資源清理工作,該線程會在lease關(guān)閉的時候終止。LeaseMonitor檢查當(dāng)前是否有l(wèi)ease,lease按照創(chuàng)建時間進(jìn)行排序 默認(rèn)每2秒種進(jìn)行一次每次只處理第一個leaseLease如果超時(1個小時),就將該lease刪除 Filesystem Managementtrack several important tablesvalid fsname - blocklist (ke

15、pt on disk, logged)Set of all valid blocksblock - machinelist (kept in memory, rebuilt dynamically from reports) machine - blocklist LRU cache of updated-heartbeat machinesFilesystem Managementabstract class INode implements Comparable protected byte name;protected INodeDirectory parent;protected lo

16、ng modificationTime;Filesystem Managementpublic class INode enum FileType DIRECTORY, FILE public static final FileType FILE_TYPES = FileType.DIRECTORY, FileType.FILE ; public static final INode DIRECTORY_INODE = new INode(FileType.DIRECTORY, null); private FileType fileType; private Block blocks; Fi

17、lesystem Managementclass INodeDirectory extends INode protected static final int DEFAULT_FILES_PER_DIRECTORY = 5; final static String ROOT_NAME = ; private List children; class INodeFile extends INode private BlockInfo blocks = null; protected short blockReplication; protected long preferredBlockSiz

18、e; Filesystem Managementclass INodeDirectory extends INode protected static final int DEFAULT_FILES_PER_DIRECTORY = 5; final static String ROOT_NAME = ; private List children; class INodeFile extends INode private BlockInfo blocks = null; protected short blockReplication; protected long preferredBlo

19、ckSize; Filesystem Managementclass LocatedBlock implements Writable private Block b; private long offset; /offset of the first byte of the block in the file private DatanodeInfo locs; Filesystem Managementpublic class DatanodeDescriptor extends DatanodeInfo private volatile BlockInfo blockList = nul

20、l; protected boolean isAlive = false; List replicateBlocks; List replicateTargetSets; List invalidateBlocks; static class DatanodeImage implements parable DatanodeDescriptor node; Filesystem Managementclass BlocksMap static class BlockInfo extends Block private INodeFile inode; private Object triple

21、ts;private static class NodeIterator implements Iterator private BlockInfo blockInfo; private int nextIdx = 0; Filesystem ManagementArrayList heartbeats = new ArrayList();private Map leases = new TreeMap();private SortedSet sortedLeases = new TreeSet();Persistence of Filesystem MetadataEditLogA tran

22、saction log: persistently record every change that occurs to file system metadata:OP_ADD,OP_RENAME,OP_DELETE,OP_MKDIR,OP_SET_REPLICATION,OP_DATANODE_ADD,OP_DATANODE_REMOVE(datanode只持久化一部分)FsImageStores the entire file system namespace, including the mapping of blocks to files and file system propert

23、iesCheckpointNamenode startupPeriodic checkpointing(secondary namenode, HTTP)checkpointdoCheckpoint()doSetup(); / Do the required initialization of the merge work node.rollEditLog(); / start logging transactions in a new edit filegetFSImage(); / Fetch fsimagegetFSEdits(); / Fetch edistdoMer

24、ge(); / Do the mergeputFSImage(token); / Upload the new image into the NameNodenamenode.rollFsImage();checkpointprivate void doMerge() throws IOException fsImage.loadFSImage(srcImage);fsImage.getEditLog().loadFSEdits(editFile);fsImage.saveFSImage(destImage);checkpoint loadFSEdits(File edits) case OP

25、_ADD : unprotectedAddFile case OP_SET_REPLICATION : unprotectedSetReplicationcase OP_RENAME : unprotectedRenameTo case OP_DELETE : unprotectedDeletecase OP_MKDIR: unprotectedMkdircase OP_DATANODE_ADDcase OP_DATANODE_REMOVENamenodeclose:close namesystemstop PendingReplication daemonstop http serverInterrupt Heartbeat daemonInterrupt Replication daemonInter

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論