版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
File
Systems計(jì)算機(jī)科學(xué)與技術(shù)系2014.11.04操作系統(tǒng)專題訓(xùn)練20142OutlineBackgroundThe
Rising
of
Big
DataFile
System
BasisFundamentalsKeyIssuesFile
Systems
Optimization
inthe
Real
WorldExample:GFS/HDFSOptimization
Techniques數(shù)據(jù)增長(zhǎng)(2010-2020)2010
年全球數(shù)字世界的規(guī)模首次達(dá)到了ZB級(jí)別,即1.227
ZB2005
年這個(gè)數(shù)字只有130
EB到2020
年 的數(shù)字世界規(guī)模將達(dá)到40ZB40
ZB相當(dāng)于地球上所有海灘上的沙粒數(shù)量的57倍;全世界人均擁有5,247
GB
的數(shù)據(jù)3Qmee:
Online
in
60
Seconds4Data
type
distribution5相對(duì)于傳統(tǒng)的結(jié)構(gòu)化數(shù)據(jù),非結(jié)構(gòu)化數(shù)據(jù)、內(nèi)容數(shù)據(jù)的增長(zhǎng)迅速,且蘊(yùn)含了極大的價(jià)值New
development:Data-Intensive
Computing
as
the
4th
ParadigmThousand
yearsago
–
ExperimentalScienceDescription
ofnatural
phenomenaLast
few
hundred
years
–Theoretical
ScienceNewton’s
Laws,
Maxwell’sEquations…Last
few
decades
–
ComputationalScienceSimulation
of
complex
phenomenaToday
–
Data-Intensive
Scienceunify
theory,
experiment,
&
simulation6其他一些說(shuō)法7Hype
Cycle
for
Big
DataHype
Cycle
for
Big
Data9Hype
Cycle
for
Big
Data10Big
Data
Opportunity
Heat
Map1114OutlineBackgroundThe
Rising
of
Big
DataFile
System
BasisFundamentalsKeyIssuesFile
Systems
Optimization
inthe
Real
WorldExample:GFS/HDFSOptimization
TechniquesFile
System
FundamentalsFile
system:
a
layer
of
OS
that
provides
a
friendly
way
forusers
to
use
block
deviseComponentsDisk
Management:
collecting
disk
blocks
into
filesNaming:
Interface
to
find
files
by
name,
not
by
blocksProtection:
Layers
to
keep
data
secureReliability/Durability:
Kee of
files
durable
despite
crashes,media
failures,
attacks,
etcFile
ionDisk
ionByte-orientedBlock-orientedNamesBlock
#sAccess
protectionNo
protectionConsistency
guaranteesNo
guarantees
beyond
block
write15File
&
DirectoryFile:
user-visible
group
of
blocks
arrangedsequentially
in
logical
spaceDirectory:
user-visible
index
map names
tofiles
or
a
relation
used
for
namingJust
a
table
of
(file
name,
unique
ID)
pairsThe
ID
canbe
used
to
look
upother
fileinformationOften
stored
in
files16What
Gets
StoredUser
data
itself
is
the
bulk
of
the
file
system'scontentsAlso
includes
meta-data
on
a
drive-wide
andper-file
basis:Drive-wide:
Available
spaceFormatting
infocharacter
set...Per-file:
nameownermodification
datephysical
layout...High-Level
OrganizationFiles
are
organized
in
a
“tree”
structure
madeofnested
directoriesOne
directory
acts
as
the
“root”“l(fā)inks”
(symlinks,
shortcuts,
etc)
provide
simplemeans
of
providing
multiple
access
paths
to
onefileOther
file
systems
can
be
“mounted”
anddropped
in
as
sub-hierarchies
(other
drives,network
shares)Low-Level
Organization
(1/2)File
data
and
meta-data
stored
separa
yFile
descriptors
+
meta-data
stored
ininodesLarge
tree
or
table
at
designatedlocation
on
diskls
how
to
look
up
file
contentsMeta-data
may
be
replicated
to
increasesystem
reliabilityLow-Level
Organization
(2/2)“Standard”
read-write
medium
is
a
harddrive
(other
media:
CDROM,
tape,
...)Viewed
as
a
sequential
array
of
blocksMust
address
~1
KB
chunk
at
a
timeTree
structure
is
“flattened”
into
blocksOverlap
reads/writes/deletes
cancause
fragmentation:
files
are
often
notstored
in
a
linear
layout–
inodes
store
all
block
numbers
related
tofileFragmentationABC(free
space)ABCA(free
space)A(free
space)CA(free
space)ADCAD(free)22File
System
RequirementsNamingShould
be
flexible,
e.g.,
allow
multiplenames
forsamefilesSupport
hierarchyfor
easy
ofusePersistenceWant
to
be
sure
data
has
been
written
to
disk
in
casecrashoccursSharing/ProtectionWant
to
restrict
whohas
access
to
filesWant
to
sharefileswith
other
users23File
System
Requirements
(cont’d)Speed
&Efficiency
for
different
access
patternsSequentialaccessRandom
accessKeyed
access
(not
usually
provided
by
OS)Minimum
Space
OverheadDisk
space
needed
tostore
metadata
is
lost
for
user
dataTwist:
all
metadata
that
is
requiredto
do
translation
mustbe
stored
ondiskTranslation
scheme
should
minimize
number
of
additional
accesses
fora
given
access
patternHarder
than,
say
page
tables
where
we
assumed
pagetablesthemselves
arenot
subject
to
paging!24Key
IssuesWhere
to
store
file
metadata?On
disk
for
local
filesystemsOn
dedicated
server(s)
for
distributed/parallel
filesystemHow
to
store
file
data?As
a
whole
on
one
diskSplit
and
stored
on
multiple
disksHow
to
guarantee
reliability
and
efficiency?Reliability:replication,
RAID,
dedicated
supervisor,
…Efficiency:replication,
cache,
hardware-specific
spaceallocation,
…How
to
set
block
size?Source:
Tanenbaum,
Modern
Operating
SystemsAssumption:
all
files
are
2KB
insizeQuestion:
Why
is
the
data
rate
corresponding
smallblocksizeslow?25Distributed
File
SystemsSupport
access
to
files
on
remote
servers– Uniform
view
of
filesMust
support
concurrencyMake
varying
guarantees
about
locking,
who“wins”with
concurrent
writes,etc...Must
gracefully
handle
dropped
connectionsCanoffer
support
for
replicationandlocal
cachingDifferent
implementations
sit
in
different
placeson
complexity/feature
scale分布式文件系統(tǒng)概況27擴(kuò)展性:節(jié)點(diǎn)的加入和退出必須以熱插拔的方式進(jìn)行;并發(fā)性:每個(gè)云組件必須被設(shè)計(jì)成在并發(fā)環(huán)境中是安全的??煽啃裕好總€(gè)云組件需要清楚所依賴的組件可能出現(xiàn)故障的方式,組件要設(shè)計(jì)成能適當(dāng)?shù)奶幚砻總€(gè)故障。效率:用戶云系統(tǒng)享數(shù)據(jù)的算法應(yīng)該避免性能瓶頸,頻繁的數(shù)據(jù)需要的副本,用戶能夠就近獲得最快的時(shí)間,同時(shí)用戶使用云服務(wù)的接口應(yīng)該盡可能簡(jiǎn)單。命名服務(wù)(naming
service)元數(shù)據(jù)管理(metadatamanagement)緩存(cache)副本(replica)接口(interface)實(shí)例NFSAFSGFS/HDFS分布式文件系統(tǒng)命名服務(wù)在物理目標(biāo)和邏輯目標(biāo)之間形成 關(guān)系基本要求位置透明:使用單一的文件命名空間位置無(wú)關(guān):物理
位置改變無(wú)需改變邏輯文件名元數(shù)據(jù)管理元數(shù)據(jù):關(guān)于數(shù)據(jù)的數(shù)據(jù)文件名、文件大小、時(shí)間戳、控制信息、用戶、組、兩種管理方式In-band
Mode(帶內(nèi)模式):元數(shù)據(jù)與數(shù)據(jù)放在一起效率低,大數(shù)據(jù)量操作容易形成瓶頸Out-of-bandMode(帶外模式):使用專門的服務(wù)其存放元數(shù)據(jù)28分布式文件系統(tǒng)緩存目的:性能,提高優(yōu)化文件效率對(duì)象:元數(shù)據(jù):提高并發(fā)度數(shù)據(jù):減少網(wǎng)絡(luò)流量位置:內(nèi)存:速度快,開銷大硬盤:支持大文件,離線:緩存一致性解決方案客戶端發(fā)起的解決方案服務(wù)端發(fā)起的解決方案29目的保證可靠性保證可用性實(shí)現(xiàn)負(fù)載均衡要求副本位置對(duì)用戶透明問(wèn)題:一致性強(qiáng)一致性弱一致性分布式文件系統(tǒng)副本接口無(wú)狀態(tài)(Sta
ess)服務(wù)
服務(wù)器不記錄狀態(tài)信息,每一個(gè)發(fā)起的請(qǐng)求都是自包含的
請(qǐng)求消息包大,處理時(shí)間長(zhǎng),不支持鎖操作有狀態(tài)(Stateful)服務(wù)服務(wù)器記錄請(qǐng)求的會(huì)話信息30架構(gòu)的選擇
Scale
Up架構(gòu)的選擇Scale
OutScale
up
vs.
Scale
out擴(kuò)展因素Scale-out(SAN/NAS)Scale-up(DAS/SAN/NAS)硬件擴(kuò)展增加 硬件更換硬件硬件限制沒(méi)有硬件限制有硬件限制可用性,可靠性更高較少管理的復(fù)雜性資源
, 管理需管理資源較少跨地理位置YesNoNAS可用Yes,NAS機(jī)制很普遍YesSAN可用Yes,增加
交換機(jī)YesDAS可用有限制Yes破壞性較少較多OutlineBackgroundThe
Rising
of
Big
DataFile
System
BasisFundamentalsKeyIssuesFile
Systems
Optimization
inthe
Real
WorldExample:GFS/HDFSOptimization
Techniques34分布式文件系統(tǒng)實(shí)例:GFS/HDFS35產(chǎn)品特征:基于低成本的PC服務(wù)器+開源Linux+千兆網(wǎng)+自研高度可伸縮:?jiǎn)渭阂?guī)??梢赃_(dá)到上萬(wàn)節(jié)點(diǎn),存儲(chǔ)能力達(dá)到幾百PB和計(jì)算相結(jié)合:通過(guò)將計(jì)算移動(dòng)到數(shù)據(jù)所在節(jié)點(diǎn),提高計(jì)算性能,主要用于數(shù)據(jù)分析數(shù)據(jù)可靠性:采用多副本保證數(shù)據(jù)的可靠性,通常采用3個(gè)副本文件被切割成固定大小的塊(Chunk)一個(gè)主Master,多個(gè)Shadow
Master多個(gè)chunkserver多clientHDFS:GFS的開源實(shí)現(xiàn)File
SystemWhy
not
use
an
existing
file
system?’s
problems
are
different
from
anyone
else’sAssumptionsHigh
component
failure
ratesInexpensive
commodity
components
fail
all
the
time“Modest”
number
of
HUGE
filesJust
a
few
millionEach
is
100GB
or
larger;
multi-GB
files
typicalFiles
are
write-once,
mostly
appended
toPerhaps
concurrentlyLargestreaming
readsHigh
sustained
throughput
favored
over
lowlatency36GFS
Design
DecisionsFiles
stored
in
chunks– Fixed
size(64MB)Reliability
through
replicationEach
chunk
replicated
across
3+
chunkserversSingle
master
to
coordinate
access,
keep
metadataSimple
centralized
managementNo
d
achingLittle
benefit
due
to
large
data
sets,
streaming
readsFamiliar
interface,
but
customized
APISimplify
the
problem;
focus
on
appsAdd
snapshot
and
record
append
operationsOptimization
of
Metadata
ServiceSplittingthe
functionsa
single
master
intoMultiple
metadataserversMultiple
supervisorsthat
are
in
charge
ofsystem
monitoring,fault
recovery,
replica
management,garbage
collection38metadata
server
implementation基本原則:–必須實(shí)現(xiàn)自動(dòng)故障恢復(fù)和節(jié)點(diǎn)宕機(jī)之后的元數(shù)據(jù)服務(wù)轉(zhuǎn)移功能,保證元數(shù)據(jù)服務(wù)盡可能的;為了支持多樣化的負(fù)載,元數(shù)據(jù)服務(wù)器必須是可擴(kuò)展的;盡量減少元數(shù)據(jù)節(jié)點(diǎn)和其它節(jié)點(diǎn)的交互次數(shù),降低元數(shù)據(jù)節(jié)點(diǎn)的負(fù)載;文件被組織成一個(gè)傳統(tǒng)的
樹讀寫鎖去冗余的控制列表39data
server
implementation,一個(gè)chunk對(duì)文件被按32M大小進(jìn)行分塊(chunk)應(yīng)Linux文件系統(tǒng)中的一個(gè)實(shí)體文件基于UUID算法產(chǎn)生128位chunk
id記錄Chunk文件數(shù)據(jù)的MD5值來(lái)檢查已保存數(shù)據(jù)的完整性40Supervisor
Implementation41基于內(nèi)聯(lián)及熱度統(tǒng)計(jì)的小文件優(yōu)化技術(shù)對(duì)于數(shù)據(jù)與元數(shù)據(jù)分離的分布式文件系統(tǒng),
小文件
主要受限于網(wǎng)絡(luò)延遲,
提出基于內(nèi)聯(lián)及熱度統(tǒng)計(jì)的小文件優(yōu)化技術(shù),
提升小文件
性能效果:采用內(nèi)聯(lián)數(shù)據(jù)后,小文件
性能提升約2倍數(shù)據(jù)遷移平衡了內(nèi)聯(lián)數(shù)據(jù)所獲得的性能優(yōu)勢(shì)與帶來(lái)的元數(shù)據(jù)服務(wù)器開銷文件內(nèi)聯(lián)技術(shù)對(duì)于小文件,將數(shù)據(jù) 在元數(shù)據(jù)中在打開文件時(shí),將數(shù)據(jù)與元數(shù)據(jù)一起發(fā)送給客戶端,消除了數(shù)據(jù)位置計(jì)算時(shí)間和跟對(duì)象 的通信基于熱度統(tǒng)計(jì)的內(nèi)聯(lián)數(shù)據(jù)遷移技術(shù)文件大小超過(guò)閥值熱度超出定義的閾值06040頻繁的內(nèi)聯(lián)數(shù)據(jù)寫
可能增加元數(shù)據(jù)服務(wù)器負(fù)擔(dān)客戶端自動(dòng)統(tǒng)計(jì)計(jì)算內(nèi)聯(lián)數(shù)據(jù)的寫 熱度進(jìn)行內(nèi)聯(lián)數(shù)據(jù)遷移的時(shí)機(jī)20Time(單位:秒)1000
2000
3000File
NumbersInline
data無(wú)inline
data有inline
data面向千億級(jí)文件Set模型的海量文件
技術(shù)需求,提供TB級(jí)數(shù)據(jù)
和快速運(yùn)營(yíng)支撐?;谒枷?.提出據(jù)Set模型,以
Set為數(shù)單元進(jìn)行部署,擴(kuò)容和管理。文件索引和數(shù)據(jù)分離,通過(guò)文件索引和磁盤數(shù)據(jù)索引共同定位文件數(shù)據(jù),磁盤數(shù)據(jù)索引全內(nèi)存化實(shí)現(xiàn)高效IO。多Set間容量均衡調(diào)度算法,根據(jù)Set狀態(tài)和空間利用率,調(diào)度新增容量,實(shí)現(xiàn)容量均衡。應(yīng)用效果:解決
相冊(cè)千億級(jí)文件的問(wèn)題;
相冊(cè)5000億+張,日增3億+張,
量100PB+。更新文件索引<文件名,chid,fid.>存?接入mastermaster文件索引引索挜扲服▂器Idx-master存?文件數(shù)據(jù)廜取文件數(shù)據(jù)<chid,fid>存?S
etfid->offset存?服▂器存?S
etfid->offset存?服▂器面向數(shù)百個(gè)業(yè)務(wù),萬(wàn)億條無(wú)熱點(diǎn)小記錄,提供高并發(fā)和低延時(shí)的低成本?;诠虘B(tài)盤的高性能分布式 技術(shù)思想1.提出單機(jī)資源復(fù)用模型,單機(jī)間和IO資源劃分成固定小規(guī)格元,
單元間
且IO空單公平?;旌纤饕夹g(shù)實(shí)現(xiàn)低內(nèi)存開銷且IO高效的本地?cái)?shù)據(jù)索引,小記錄采用哈希索引減少索引數(shù)量,大記錄獨(dú)立索引提升IO效率。SSD應(yīng)用層寫優(yōu)化,寫緩存實(shí)現(xiàn)低延時(shí)響應(yīng),動(dòng)態(tài)索引和寫合并將高并發(fā)小記錄隨機(jī)寫轉(zhuǎn)化為低頻率的大塊寫。量10+TB,小記錄(<100字節(jié))應(yīng)用效果:提供SNS基礎(chǔ)數(shù)據(jù)服務(wù),數(shù)據(jù)可靠性高,高密度讀寫,
量40+w/s,長(zhǎng)尾
無(wú)熱點(diǎn)。存?a
接入SSD存?
服▂器2
GB的存??元共享內(nèi)存寫緩存混合索引單元公平的讀寫IO調(diào)度3.根據(jù)索引
廜
取ssd數(shù)據(jù)1.廜
取更新
2.廜
取索引SSD存?
服▂器2
GB的存??元共享內(nèi)存寫緩存混合索引單元公平的讀寫IO調(diào)度2.冥
裝成定椏
大數(shù)據(jù)?寫入1.廜
取更新
3.更新索引Get/Set/Del?
▂
2元數(shù)據(jù)管理急?
▂
1
急
源源增量同步IceFS:Separating
Physical
StructureSource:Physical
Disentanglement
in
aContainer-Based
File
System,
OSDI
2014New ion:
cubeenables
the
grou of
files
and
directoriesinside
a
physically
isolated
containerBenefitslocalized
reaction
to
faultsfast
recoveryconcurrent
file-system
updates45Using
New
Media46DRAM
ManagementLRU
block
replacementFlash
ManagementSegment
=
A
set
of
blocks/Erasing
unitSegment
list
(Free/Clean/Dirty)Segment
replacement
(FIFO
or
LRU)Disk
Management– Power
management
by
spin
up/downSource:FLASHCACHE
[HCSS’94]Using
New
MediaTo
reduce
the
power
consumption
ofdiskNVCacheTo
reduce
disk
power
consumption
by
combining
adaptive
diskspin-down
algorithmTo
extend
spin-down
periods
by
undertaking
i
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年外債借款合同標(biāo)準(zhǔn)范本及信息披露要求3篇
- 2024年擔(dān)保與保證合同新型擔(dān)保方式研發(fā)與應(yīng)用合同3篇
- 2024年新能源產(chǎn)業(yè)質(zhì)押借款合同范本3篇
- 2024年人工智能產(chǎn)業(yè)信托資金借款合同3篇
- 2024年二零二四年度大棚蔬菜種植基地病蟲害生物防治技術(shù)研究合同3篇
- 2024年度農(nóng)產(chǎn)品質(zhì)量安全風(fēng)險(xiǎn)評(píng)估與防控合作協(xié)議3篇
- 2024年農(nóng)業(yè)項(xiàng)目質(zhì)押擔(dān)保及反擔(dān)保合同范本詳解3篇
- 2024年度直播平臺(tái)內(nèi)容版權(quán)許可合同3篇
- 2024年度渣土運(yùn)輸勞務(wù)合同環(huán)保責(zé)任追究范本3篇
- 防性安全教育
- DZ/T 0462.5-2023 礦產(chǎn)資源“三率”指標(biāo)要求 第5部分:金、銀、鈮、鉭、鋰、鋯、鍶、稀土、鍺(正式版)
- (高清版)JTG 3370.1-2018 公路隧道設(shè)計(jì)規(guī)范 第一冊(cè) 土建工程
- 注塑車間工作總結(jié)
- 2024春期國(guó)開電大本科《城市管理學(xué)》在線形考(形考任務(wù)1至4)試題及答案
- 綜合英語(yǔ)智慧樹知到期末考試答案2024年
- 教師教學(xué)風(fēng)格對(duì)小學(xué)生學(xué)習(xí)習(xí)慣形成的影響-(畢業(yè)論文)
- 政府機(jī)關(guān)保安服務(wù)項(xiàng)目整體服務(wù)方案
- 藥物分析年終述職報(bào)告
- 餐飲開晨會(huì)班會(huì)講解課件
- 高壓氧工作總結(jié)
- 丙烯精餾塔工藝設(shè)計(jì)
評(píng)論
0/150
提交評(píng)論