利用Hadoop構建云計算基礎教程_第1頁
利用Hadoop構建云計算基礎教程_第2頁
利用Hadoop構建云計算基礎教程_第3頁
利用Hadoop構建云計算基礎教程_第4頁
利用Hadoop構建云計算基礎教程_第5頁
已閱讀5頁,還剩58頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

TopofForm

BottomofForm

\o"Home"

Home

\o"WhatisBigData?"

BigData

\o"FindHadoopTutorialshere"

HadoopTutorials

\o"CassandraandCQL"

Cassandra

\o"CassandraHectorAPI"

HectorAPI

\o"AskforaTutorial"

RequestTutorial

\o"AboutMeandBigDataPlanet"

About

LABELS:

HADOOP-TUTORIAL

,

HDFS

3OCTOBER2013

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopisanopensourcesoftwareframeworkthatsupportsdataintensivedistributedapplicationswhichislicensedunderApachev2license.

At-leastthisiswhatyouaregoingtofindasthefirstlineofdefinitiononHadoopinWikipedia.So

whatisdataintensivedistributedapplications?

Well

dataintensive

isnothingbut

BigData

(datathathasoutgrowninsize)anddistributedapplications

aretheapplicationsthatworksonnetworkbycommunicatingand

coordinatingwitheachotherbypassingmessages.(sayusingaRPCinterprocesscommunicationorthroughMessage-Queue)

HenceHadoopworksonadistributedenvironmentandisbuildtostore,handleandprocesslargeamountofdataset(inpetabytes,exabyteandmore).Nowheresinceiamsayingthathadoopstorespetabytesofdata,thisdoesn'tmeanthatHadoopisadatabase.Againrememberitsaframeworkthathandleslargeamountofdataforprocessing.YouwillgettoknowthedifferencebetweenHadoopandDatabases(orNoSQLDatabases,wellthat'swhatwecallBigData'sdatabases)asyougodownthelineinthecomingtutorials.

HadoopwasderivedfromtheresearchpaperpublishedbyGoogleon

GoogleFileSystem(GFS)

and

Google'sMapReduce.SotherearetwointegralpartsofHadoop:

HadoopDistributedFileSystem(HDFS)

and

HadoopMapReduce.

HadoopDistributedFileSystem(HDFS)

HDFSisafilesystemdesignedforstoring

verylargefiles

with

streamingdataaccesspatterns,runningonclustersof

commodityhardware.

WellLetsgetintothedetailsofthestatementmentionedabove:

VeryLargefiles:

Nowwhenwesayverylargefileswemeanherethatthesizeofthefilewillbeinarangeofgigabyte,terabyte,petabyteormaybemore.

Streamingdataaccess:

HDFSisbuiltaroundtheideathatthemostefficientdataprocessingpatternisawrite-once,read-many-timespattern.Adatasetistypicallygeneratedorcopiedfromsource,andthenvariousanalysesareperformedonthatdatasetovertime.Eachanalysiswillinvolvealargeproportion,ifnotall,ofthedataset,sothetimetoreadthewholedatasetismoreimportantthanthelatencyinreadingthefirstrecord.

CommodityHardware:

Hadoopdoesn'trequireexpensive,highlyreliablehardware.It’sdesignedtorun

onclustersofcommodityhardware(commonlyavailablehardwarethatcanbeobtainedfrommultiplevendors)forwhichthechanceofnodefailureacrosstheclusterishigh,atleastforlargeclusters.HDFSisdesignedtocarryonworkingwithoutanoticeableinterruptiontotheuserinthefaceofsuchfailure.

NowherewearetalkingaboutaFileSystem,HadoopDistributedFileSystem.AndweallknowaboutafewoftheotherFileSystemslikeLinuxFileSystemandWindowsFileSystem.Sothenextquestioncomesis...

WhatisthedifferencebetweennormalFileSystemandHadoopDistributedFileSystem?

ThemajortwodifferencesthatisnotablebetweenHDFSandotherFilesystemsare:

BlockSize:

Everydiskismadeupofablocksize.Andthisisthe

minimum

amountofdatathatiswrittenandreadfromaDisk.NowaFilesystemalsoconsistsofblockswhichismadeoutoftheseblocksonthedisk.Normallydiskblocksareof512bytesandthoseoffilesystemareofafewkilobytes.

Incaseof

HDFS

wealsohavetheblocksconcept.Buthereoneblocksizeisof64MBbydefaultandwhichcanbeincreasedinanintegralmultipleof64i.e.128MB,256MB,512MBorevenmoreinGB's.Italldependontherequirementanduse-cases.

SoWhyaretheseblockssizesolargeforHDFS?keeponreadingandyouwillgetitinanextfewtutorials:)

Metadata

Storage:

Innormalfilesystem

thereisa

hierarchical

storageofmetadatai.e.letssaythereisafolder

ABC,

insidethatfolderthereisagainoneanotherfolder

DEF,

andinsidethatthereis

hello.txt

file.Nowtheinformationabout

hello.txt

(i.e.metadatainfoofhello.txt)

filewillbewith

DEF

andagainthemetadataof

DEF

willbewith

ABC.Hencethisformsa

hierarchy

andthishierarchyismaintaineduntiltherootofthefilesystem.Butin

HDFS

wedon'thaveahierarchyofmetadata.Allthemetadatainformationresideswithasinglemachineknownas

Namenode

(orMasterNode)onthecluster.Andthisnodecontainsalltheinformationaboutotherfilesandfolderandlotsofotherinformationtoo,whichwewilllearninthenextfewtutorials.:)

WellthiswasjustanoverviewofHadoopandHadoopDistributedFileSystem.NowinthenextpartiwillgointothedepthofHDFSandthereafterMapReduceandwillcontinuefromhere...

Letmeknowifyouhaveanydoubtsin

understanding

anythingintothecommentsectionandiwillbereallygladtoanswerthesame:)

IfyoulikewhatyoujustreadandwanttocontinueyourlearningonBIGDATAyoucan

subscribetoourEmail

andLikeour

facebookpage

Thesemightalsohelpyou:,

HadoopTutorial:Part4-WriteOperationsinHDFS

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

BestofBooksandResourcestoGetStartedwithHadoop

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

HadoopInstallationonLocalMachine(SinglenodeCluster)

FindCommentsbeloworAddone

RomainRigaux

said...

Nicesummary!

\o"commentpermalink"

October03,2013

pragyakhare

said...

Iknowi'mabeginnerandthisquestionmytbeasilly1butcanyoupleaseexplaintomethathowPARALLELISMisachievedviamap-reduceattheprocessorlevel???ifI'veadualcoreprocessor,isitthatonly2jobswillrunatatimeinparallel?

\o"commentpermalink"

October05,2013

Anonymoussaid...

HiIamfromMainframebackgroundandwithlittleknowledgeofcorejava...DoyouthinkJavaisneededforlearningHadoopinadditiontoHive/PIG?EvenwanttolearnJavaformapreducebutcouldn'tfindwhatallwillbeusedinrealtime..anddefinitiveguidebooksseemstoughforlearningmapreducewithJava..anyoptionwhereIcanlearnitstepbystep?

Sorryforlongcomment..butitwouldbehelpfulifyoucanguideme..

\o"commentpermalink"

October05,2013

DeepakKumar

said...

@PragyaKhare...

Firstthingalwaysremember...theonePopularsayingNOQuestionsareFoolish:)Andbtwitisaverygoodquestion.

Actuallytherearetwothings:

Oneiswhatwillbethebestpractice?andotheriswhathappensintherebydefault?...

Wellbydefaultthenumberofmapperandreducerissetto2foranytasktracker,henceoneseesamaximumof2mapsand2reducesatagiveninstanceonaTaskTracker(whichisconfigurable)..WellthisDoesn'tonlydependontheProcessorbutonlotsofotherfactoraswelllikeram,cpu,power,diskandothers

/blog/best-practices-for-selecting-apache-hadoop-hardware/

Andfortheotherfactori.eforBestPracticesitdependsonyourusecase.Youcangothroughthe3rdpointofthebelowlinktounderstanditmoreconceptually

/blog/2009/12/7-tips-for-improving-mapreduce-performance/

WelliwillexplainallthesewheniwillreachtheadvanceMapReducetutorials..Tillthenkeepreading!!:)

\o"commentpermalink"

October05,2013

DeepakKumar

said...

@Anonymous

AsHadoopiswritteninJava,somostofitsAPI'sarewrittenincoreJava...WelltoknowabouttheHadooparchitectureyoudon'tneedJava...ButtogotoitsAPILevelandstartprogramminginMapReduceyouneedtoknowCoreJava.

Andasfortherequirementinjavayouhaveaskedfor...youjustneedsimplecorejavaconceptsandprogrammingforHadoopandMapReduce..AndHive/PIGaretheSQLkindofdataflowlanguagesthatisreallyeasytolearn...Andsinceyouarefromaprogrammingbackgrounditwon'tbeverydifficulttolearnjava:)youcanalsogothroughthelinkbelowforfurtherdetails:)

/2013/09/What-are-the-Pre-requsites-for-getting-started-with-Big-Data-Technologies.html

\o"commentpermalink"

October05,2013

PostaComment

\o"NewerPost"

NewerPost→

\o"OlderPost"

←OlderPost

ABOUTTHEAUTHOR

DEEPAKKUMAR

BigData/HadoopDeveloper,SoftwareEngineer,Thinker,Learner,Geek,Blogger,Coder

IlovetoplayaroundData.

BigData

!

SubscribeupdatesviaEmail

TopofForm

JoinBigDataPlanettocontinueyourlearningonBigDataTechnologies

BottomofForm

GetUpdatesonFacebook

BigDataLibraries

BIGDATANEWS

CASSANDRA

HADOOP-TUTORIAL

HDFS

HECTOR-API

INSTALLATION

SQOOP

WhichNoSQLDatabasesaccordingtoyouisMostPopular?

GetConnectedonGoogle+

MostPopularBlogArticle

HadoopInstallationonLocalMachine(SinglenodeCluster)

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

WhatarethePre-requisitesforgettingstartedwithBigDataTechnologies

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part4-WriteOperationsinHDFS

BestofBooksandResourcestoGetStartedwithHadoop

HowtouseCassandraCQLinyourJavaApplication

BacktoTop▲

#Note:UseScreenResolutionof1280pxandmoretoviewthewebsite@itsbest.AlsousethelatestversionofthebrowserasthewebsiteusesHTML5andCSS3:)

\o"Twitter:@bigdataplanet"

Twitter

\o"Facebook:BigDataPlanet"

Facebook

\o"RSSFeed:Blog"

RSS

\o"GooglePlus:BigDataPlanet"

Google

ABOUTME

CONTACT

PRIVACYPOLICY

?2013AllRightsReserved

BigDataPlanet.

Allarticlesonthiswebsite

by

DeepakKumar

islicensedundera

CreativeCommonsAttribution-NonCommercial-ShareAlike3.0UnportedLicense

TopofForm

BottomofForm

\o"Home"

Home

\o"WhatisBigData?"

BigData

\o"FindHadoopTutorialshere"

HadoopTutorials

\o"CassandraandCQL"

Cassandra

\o"CassandraHectorAPI"

HectorAPI

\o"AskforaTutorial"

RequestTutorial

\o"AboutMeandBigDataPlanet"

About

LABELS:

HADOOP-TUTORIAL

,

HDFS

6OCTOBER2013

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

Inthelasttutorialon

WhatisHadoop?

ihavegivenyouabriefideaaboutHadoop.SothetwointegralpartsofHadoopisHadoop

HDFS

andHadoop

MapReduce.

LetsgofurtherdeepinsideHDFS.

HadoopDistributedFileSystem

(HDFS)

Concepts:

FirsttakealookatthefollowingtwoterminologiesthatwillbeusedwhiledescribingHDFS.

Cluster:Ahadoopclusterismadebyhavingmanymachinesinanetwork,eachmachineistermedasanode,andthesenodestalkstoeachotheroverthenetwork.

BlockSize:

Thisistheminimumamountofsizeofoneblockinafilesystem,inwhichdatacanbekeptcontiguously.

ThedefaultsizeofasingleblockinHDFSis64Mb.

InHDFS,Dataiskeptbysplittingitintosmallchunksorparts.Letssayyouhaveatextfileof200MBandyouwanttokeepthisfileinaHadoopCluster.Thenwhathappensisthat,

thefilebreaksorsplitsintoalargenumberofchunks,whereeachchunkisequaltotheblocksizethatissetfortheHDFScluster(whichis64MBbydefault).

Hencea200Mboffilegetssplitinto4parts,3partsof64mband1partof8mb,andeachpartwillbekeptonadifferentmachine.OnwhichmachinewhichsplitwillbekeptisdecidedbyNamenode,aboutwhichwewillbediscussingindetailsbelow.

NowinaHadoopDistributedFileSystemorHDFSCluster,therearetwokindsofnodes,AMasterNodeandmanyWorkerNodes.Theseareknownas:

Namenode(masternode)andDatanode(workernode).

Namenode:

Thenamenodemanagesthefilesystemnamespace.Itmaintainsthefilesystemtreeandthemetadataforallthefilesanddirectoriesinthetree.Soitcontainstheinformationofallthefiles,directoriesandtheirhierarchyintheclusterintheformofa

NamespaceImage

and

editlogs.AlongwiththefilesysteminformationitalsoknowsabouttheDatanodeonwhich

alltheblocksofafileiskept.

Aclientaccessesthefilesystemonbehalfoftheuserbycommunicatingwiththenamenodeanddatanodes.TheclientpresentsafilesysteminterfacesimilartoaPortableOperatingSystemInterface(POSIX),sotheusercodedoesnotneedtoknowaboutthenamenodeanddatanodetofunction.

Datanode:

Thesearetheworkersthatdoestherealwork.Andherebyrealworkwemeanthatthestorageofactualdataisdonebythedatanode.Theystoreandretrieveblockswhentheyaretoldto(byclientsorthenamenode),andtheyreportbacktothenamenodeperiodicallywithlistsofblocksthattheyarestoring.

Hereoneimportantthingthatistheretonote:

InoneclustertherewillbeonlyoneNamenodeandtherecanbeNnumberofdatanodes.

SincetheNamenodecontainsthemetadataofallthefilesanddirectoriesandalsoknowsaboutthedatanodeonwhicheachsplitoffilesarestored.SoletssayNamenodegoesdownthenwhatdoyouthinkwillhappen?.

Yes,iftheNamenodeisDownwecannotaccessanyofthefilesanddirectoriesinthecluster.

Evenwewillnotbeabletoconnectwithanyofthedatanodestogetanyofthefiles.

Nowthinkofit,sincewehavekeptourfilesbysplittingitin

different

chunksandalsowehavekeptthemindifferentdatanodes.AnditistheNamenodethatkeepstrackofallthefilesmetadata.SoonlyNamenodeknowshowtoreconstructafilebackintoonefromallthesplits.andthisisthereasonthatifNamenodeisdowninahadoopclustersoeverythingisdown.

Thisisalsothereason

that's

whyHadoopisknownasaSinglePointoffailure.

NowsinceNamenodeissoimportant,wehavetomakethenamenoderesilienttofailure.Andforthathadoopprovidesuswithtwomechanism.

Thefirstwayistobackupthefilesthatmakeupthepersistentstateofthefilesystemmetadata.Hadoopcanbeconfiguredsothatthenamenodewritesitspersistentstatetomultiplefilesystems.Thesewritesaresynchronousandatomic.TheusualconfigurationchoiceistowritetolocaldiskaswellasaremoteNFSmount.

Thesecondwayisrunninga

SecondaryNamenode.

Wellasthenamesuggests,it

doesnot

actlikeaNamenode.Soifitdoesn'tactlikeanamenodehowdoesitpreventsfromthefailure.

Wellthe

Secondarynamenode

alsocontainsa

namespaceimage

and

editlogs

likenamenode.Nowaftereverycertainintervaloftime(whichisonehourbydefault)

itcopiesthe

namespaceimage

from

namenode

andmergethis

namespaceimage

withthe

editlog

andcopyitbacktothe

namenode

sothat

namenode

willhavethefreshcopyof

namespaceimage.Nowletssupposeatanyinstanceoftimethe

namenodegoesdownandbecomescorruptthenwecanrestart

someothermachinewiththenamespaceimageandtheeditlogthat'swhatwehavewiththe

secondarynamenodeandhencecanbepreventedfromatotalfailure.

SecondaryNamenodetakesalmostthesameamountofmemoryandCPUforitsworkingastheNamenode.Soitisalsokeptinaseparatemachinelikethatofanamenode.Henceweseeherethat

inasingleclusterwehaveoneNamenode,oneSecondarynamenodeandmanyDatanodes,andHDFSconsistsofthesethreeelements.

ThiswasagainanoverviewofHadoopDistributedFileSystemHDFS,InthenextpartofthetutorialwewillknowabouttheworkingofNamenodeandDatanodeinamoredetailedmanner.WewillknowhowreadandwritehappensinHDFS.

Letmeknowifyouhaveanydoubtsin

understanding

anythingintothecommentsectionandiwillbereallygladtoansweryourquestions:)

IfyoulikewhatyoujustreadandwanttocontinueyourlearningonBIGDATAyoucan

subscribetoourEmail

andLikeour

facebookpage

Thesemightalsohelpyou:,

HadoopInstallationonLocalMachine(SinglenodeCluster)

HadoopTutorial:Part4-WriteOperationsinHDFS

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

BestofBooksandResourcestoGetStartedwithHadoop

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

FindCommentsbeloworAddone

vishwash

said...

veryinformative...

\o"commentpermalink"

October07,2013

TusharKarande

said...

Thanksforsuchainformatictutorials:)

pleasekeepposting..waitingformore...:)

\o"commentpermalink"

October08,2013

Anonymoussaid...

NiceinformationButIhaveonedoubtlike,whatistheadvantageofkeepingthefileinpartofchunksondifferent-2datanodes?Whatkindofbenefitwearegettinghere?

\o"commentpermalink"

October08,2013

DeepakKumar

said...

@Anonymous:Welltherearelotsofreasons...iwillexplainthatwithgreatdetailsinthenextfewarticles...

Butfornowletusunderstandthis...sincewehavesplitthefileintotwo,nowwecantakethepoweroftwoprocessors(parallelprocessing)ontwodifferentnodestodoouranalysis(likesearch,calculation,predictionandlotsmore)..Againletssaymyfilesizeisinsomepetabytes...Yourwon'tfindoneHarddiskthatbig..andletssayifitisthere...howdoyouthinkthatwearegoingtoreadandwriteonthatharddisk(thelatencywillbereallyhightoreadandwrite)...itwilltakelotsoftime...Againtherearemorereasonsforthesame...Iwillmakeyouunderstandthisinmoretechnicalwaysinthecomingtutorials...Tillthenkeepreading:)

\o"commentpermalink"

October08,2013

PostaComment

\o"NewerPost"

NewerPost→

\o"OlderPost"

←OlderPost

ABOUTTHEAUTHOR

DEEPAKKUMAR

BigData/HadoopDeveloper,SoftwareEngineer,Thinker,Learner,Geek,Blogger,Coder

IlovetoplayaroundData.

BigData

!

SubscribeupdatesviaEmail

TopofForm

JoinBigDataPlanettocontinueyourlearningonBigDataTechnologies

BottomofForm

GetUpdatesonFacebook

BigDataLibraries

BIGDATANEWS

CASSANDRA

HADOOP-TUTORIAL

HDFS

HECTOR-API

INSTALLATION

SQOOP

WhichNoSQLDatabasesaccordingtoyouisMostPopular?

GetConnectedonGoogle+

MostPopularBlogArticle

HadoopInstallationonLocalMachine(SinglenodeCluster)

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

WhatarethePre-requisitesforgettingstartedwithBigDataTechnologies

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part4-WriteOperationsinHDFS

BestofBooksandResourcestoGetStartedwithHadoop

HowtouseCassandraCQLinyourJavaApplication

BacktoTop▲

#Note:UseScreenResolutionof1280pxandmoretoviewthewebsite@itsbest.AlsousethelatestversionofthebrowserasthewebsiteusesHTML5andCSS3:)

\o"Twitter:@bigdataplanet"

Twitter

\o"Facebook:BigDataPlanet"

Facebook

\o"RSSFeed:Blog"

RSS

\o"GooglePlus:BigDataPlanet"

Google

ABOUTME

CONTACT

PRIVACYPOLICY

?2013AllRightsReserved

BigDataPlanet.

Allarticlesonthiswebsite

by

DeepakKumar

islicensedundera

CreativeCommonsAttribution-NonCommercial-ShareAlike3.0UnportedLicense

TopofForm

BottomofForm

\o"Home"

Home

\o"WhatisBigData?"

BigData

\o"FindHadoopTutorialshere"

HadoopTutorials

\o"CassandraandCQL"

Cassandra

\o"CassandraHectorAPI"

HectorAPI

\o"AskforaTutorial"

RequestTutorial

\o"AboutMeandBigDataPlanet"

About

LABELS:

HADOOP-TUTORIAL

,

HDFS

3OCTOBER2013

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopisanopensourcesoftwareframeworkthatsupportsdataintensivedistributedapplicationswhichislicensedunderApachev2license.

At-leastthisiswhatyouaregoingtofindasthefirstlineofdefinitiononHadoopinWikipedia.So

whatisdataintensivedistributedapplications?

Well

dataintensive

isnothingbut

BigData

(datathathasoutgrowninsize)anddistributedapplications

aretheapplicationsthatworksonnetworkbycommunicatingand

coordinatingwitheachotherbypassingmessages.(sayusingaRPCinterprocesscommunicationorthroughMessage-Queue)

HenceHadoopworksonadistributedenvironmentandisbuildtostore,handleandprocesslargeamountofdataset(inpetabytes,exabyteandmore).Nowheresinceiamsayingthathadoopstorespetabytesofdata,thisdoesn'tmeanthatHadoopisadatabase.Againrememberitsaframeworkthathandleslargeamountofdataforprocessing.YouwillgettoknowthedifferencebetweenHadoopandDatabases(orNoSQLDatabases,wellthat'swhatwecallBigData'sdatabases)asyougodownthelineinthecomingtutorials.

HadoopwasderivedfromtheresearchpaperpublishedbyGoogleon

GoogleFileSystem(GFS)

and

Google'sMapReduce.SotherearetwointegralpartsofHadoop:

HadoopDistributedFileSystem(HDFS)

and

HadoopMapReduce.

HadoopDistributedFileSystem(HDFS)

HDFSisafilesystemdesignedforstoring

verylargefiles

with

streamingdataaccesspatterns,runningonclustersof

commodityhardware.

WellLetsgetintothedetailsofthestatementmentionedabove:

VeryLargefiles:

Nowwhenwesayverylargefileswemeanherethatthesizeofthefilewillbeinarangeofgigabyte,terabyte,petabyteormaybemore.

Streamingdataaccess:

HDFSisbuiltaroundtheideathatthemostefficientdataprocessingpatternisawrite-once,read-many-timespattern.Adatasetistypicallygeneratedorcopiedfromsource,andthenvariousanalysesareperformedonthatdatasetovertime.Eachanalysiswillinvolvealargeproportion,ifnotall,ofthedataset,sothetimetoreadthewholedatasetismoreimportantthanthelatencyinreadingthefirstrecord.

CommodityHardware:

Hadoopdoesn'trequireexpensive,highlyreliablehardware.It’sdesignedtorun

onclustersofcommodityhardware(commonlyavailablehardwarethatcanbeobtainedfrommultiplevendors)forwhichthechanceofnodefailureacrosstheclusterishigh,atleastforlargeclusters.HDFSisdesignedtocarryonworkingwithoutanoticeableinterruptiontotheuserinthefaceofsuchfailure.

NowherewearetalkingaboutaFileSystem,HadoopDistributedFileSystem.AndweallknowaboutafewoftheotherFileSystemslikeLinuxFileSystemandWindowsFileSystem.Sothenextquestioncomesis...

WhatisthedifferencebetweennormalFileSystemandHadoopDistributedFileSystem?

ThemajortwodifferencesthatisnotablebetweenHDFSandotherFilesystemsare:

BlockSize:

Everydiskismadeupofablocksize.Andthisisthe

minimum

amountofdatathatiswrittenandreadfromaDisk.NowaFilesystemalsoconsistsofblockswhichismadeoutoftheseblocksonthedisk.Normallydiskblocksareof512bytesandthoseoffilesystemareofafewkilobytes.

Incaseof

HDFS

wealsohavetheblocksconcept.Buthereoneblocksizeisof64MBbydefaultandwhichcanbeincreasedinanintegralmultipleof64i.e.128MB,256MB,512MBorevenmoreinGB's.Italldependontherequirementanduse-cases.

SoWhyaretheseblockssizesolargeforHDFS?keeponreadingandyouwillgetitinanextfewtutorials:)

Metadata

Storage:

Innormalfilesystem

thereisa

hierarchical

storageofmetadatai.e.letssaythereisafolder

ABC,

insidethatfolderthereisagainoneanotherfolder

DEF,

andinsidethatthereis

hello.txt

file.Nowtheinformationabout

hello.txt

(i.e.metadatainfoofhello.txt)

filewillbewith

DEF

andagainthemetadataof

DEF

willbewith

ABC.Hencethisformsa

hierarchy

andthishierarchyismaintaineduntiltherootofthefilesystem.Butin

HDFS

wedon'thaveahierarchyofmetadata.Allthemetadatainformationresideswithasinglemachineknownas

Namenode

(orMasterNode)onthecluster.Andthisnodecontainsalltheinformationaboutotherfilesandfolderandlotsofotherinformationtoo,whichwewilllearninthenextfewtutorials.:)

WellthiswasjustanoverviewofHadoopandHadoopDistributedFileSystem.NowinthenextpartiwillgointothedepthofHDFSandthereafterMapReduceandwillcontinuefromhere...

Letmeknowifyouhaveanydoubtsin

understanding

anythingintothecommentsectionandiwillbereallygladtoanswerthesame:)

IfyoulikewhatyoujustreadandwanttocontinueyourlearningonBIGDATAyoucan

subscribetoourEmail

andLikeour

facebookpage

Thesemightalsohelpyou:,

HadoopTutorial:Part4-WriteOperationsinHDFS

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

BestofBooksandResourcestoGetStartedwithHadoop

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

HadoopInstallationonLocalMachine(SinglenodeCluster)

FindCommentsbeloworAddone

RomainRigaux

said...

Nicesummary!

\o"commentpermalink"

October03,2013

pragyakhare

said...

Iknowi'mabeginnerandthisquestionmytbeasilly1butcanyoupleaseexplaintomethathowPARALLELISMisachievedviamap-reduceattheprocessorlevel???ifI'veadualcoreprocessor,isitthatonly2jobswillrunatatimeinparallel?

\o"commentpermalink"

October05,2013

Anonymoussaid...

HiIamfromMainframebackgroundandwithlittleknowledgeofcorejava...DoyouthinkJavaisneededforlearningHadoopinadditiontoHive/PIG?EvenwanttolearnJavaformapreducebutcouldn'tfindwhatallwillbeusedinrealtime..anddefinitiveguidebooksseemstoughforlearningmapreducewithJava..anyoptionwhereIcanlearnitstepbystep?

Sorryforlongcomment..butitwouldbehelpfulifyoucanguideme..

\o"commentpermalink"

October05,2013

DeepakKumar

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論