




版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
Hive&Performance
TorontoHadoopUserGroupJuly232013
Presenter:
AdamMuise–Hortonworksamuise@
Page1
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Agenda
?Hive–WhatIsItGoodFor?
?Hive’sArchitectureandSQLCompatibility
?TurningHivePerformanceto11
?GetDataInandOutofHive
?HiveSecurity
?ProjectStinger–MakingHive100xFaster
?ConnectingtoHiveFromPopularTools
Page2
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Hive–SQLAnalyticsForAnyDataSize
StoreandQueryallDatainHive
UseExis6ngSQLTools
andExis6ngSQLProcesses
Weblog
Opera1onal/MPP
MobileSensor
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page3
Hive’sFocus
?ScalableSQLprocessingoverdatainHadoop
?Scalesto100PB+
?StructuredandUnstructureddata
Page4
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
ComparingHivewithRDBMS
Hive
RDBMS
SQLInterface.
SQLInterface.
Focusonanaly1cs.
Mayfocusononlineoranaly1cs.
Notransac1ons.
Transac1onsusuallysupported.
Par11onadds,norandomINSERTs.
In-Placeupdatesnotna1velysupported(butarepossible).
RandomINSERTandUPDATEsupported.
Distributedprocessingviamap/reduce.
Distributedprocessingvariesbyvendor(ifavailable).
Scalestohundredsofnodes.
Seldomscalebeyond20nodes.
Builtforcommodityhardware.
OQenbuiltonproprietaryhardware(especiallywhenscalingout).
Lowcostperpetabyte.
What’sapetabyte?
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page5
Agenda
?Hive–WhatIsItGoodFor?
?Hive’sArchitectureandSQLCompatibility
?TurningHivePerformanceto11
?GetDataInandOutofHive
?HiveSecurity
?ProjectStinger–MakingHive100xFaster
?ConnectingtoHiveFromPopularTools
Page6
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Hive:TheSQLInterfacetoHadoop
Hive
Compiler
2
Hive
Hive
OSQL
CLI
HiveServer2
Optimizer
Executor
1
JDBC/ODBC
WebUI
3
4
1
NameNode
JobTracker
Task
Tracker
2
3
Hadoop
UserissuesSQLquery
Hiveparsesandplansquery
DataNode
Map/Reduce
4
QueryconvertedtoMap/ReduceMap/ReducerunbyHadoop
Page7
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Hive:ReliableSQLProcessingatScale
HDFS
HDFS
Node1
Node1
Node1
SQL
Node2
Node2
Node2
Hive
Node3
Node3
Node3
Map/Reduce
Node4
TaskTracker
Node4
Node4
>
HDFS
Time1:
Job=50%Complete
ProgressStoredinHDFS
Time2:
Node3Fails,
Job=85%Complete
Time3:
JobmovestoNode4Job=100%Complete
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page8
SQLCoverage:SQL92withExtensions
Legend
Available
NewinHive0.11Roadmap
SQLDatatypes
SQLSeman6cs
INT
SELECT,LOAD,INSERTfromquery
TINYINT/SMALLINT/BIGINT
ExpressionsinWHEREandHAVING
BOOLEAN
GROUPBY,ORDERBY,SORTBY
FLOAT
CLUSTERBY,DISTRIBUTEBY
DOUBLE
Sub-queriesinFROMclause
STRING
GROUPBY,ORDERBY
BINARY
ROLLUPandCUBE
TIMESTAMP
UNION
ARRAY,MAP,STRUCT,UNION
LEFT,RIGHTandFULLINNER/OUTERJOIN
DECIMAL
CROSSJOIN,LEFTSEMIJOIN
CHAR
Windowingfunc1ons(OVER,RANK,etc.)
VARCHAR
Sub-queriesforIN/NOTIN,HAVING
DATE
EXISTS/NOTEXISTS
INTERSECT,EXCEPT
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page9
Agenda
?Hive–WhatIsItGoodFor?
?Hive’sArchitectureandSQLCompatibility
?TurningHivePerformanceto11
?GetDataInandOutofHive
?HiveSecurity
?ProjectStinger–MakingHive100xFaster
?ConnectingtoHiveFromPopularTools
Page10
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
DataAbstractionsinHive
Op1onalPerTable
Database
Par11on
Par11on
Table
Par11on
UnskewedKeys
Table
SkewedKeys
Bucket
Bucket
Bucket
Par11ons,bucketsandskewsfacilitate
faster,moredirectdataaccess.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page11
“Iheardyoushouldavoidjoins…”
?“Joinsareevil”–CalHenderson
-Joinsshouldbeavoidedinonlinesystems.
?Joinsareunavoidableinanalytics.
-Makingjoinsfastisthekeydesignpoint.
Page12
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
QuickRefresheronJoins
customerorder
?rstlastidcidpricequan6ty
Nick
11911
4150
10.50
3
Toner
VeronaHollen119151191440.5010
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
3491
5.99
5
Rodger
Clayton
11914
2934
39.99
22
SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;
Joinsmatchvaluesfromonetableagainstvaluesinanothertable.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page13
HiveJoinStrategies
Type
Approach
Pros
Cons
Shu?eJoin
Joinkeysareshu?edusingmap/reduceandjoins
performedjoinside.
Worksregardlessofdatasizeorlayout.
Mostresource-
intensiveand
slowestjointype.
BroadcastJoin
Smalltablesareloadedinto
memoryinallnodes,
mapperscansthroughthelargetableandjoins.
Veryfast,singlescan
throughlargesttable.
Allbutonetablemustbesmallenoughto?tinRAM.
Sort-
Merge-
Bucket
Join
Mapperstakeadvantageofco-loca1onofkeystodo
e?cientjoins.
Veryfastfortablesofanysize.
Datamustbesortedandbucketedaheadof1me.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page14
ShuffleJoinsinMapReduce
customerorder
?rstlastidcidpricequan6ty
Nick
11911
4150
10.50
3
Toner
VeronaHollen119151191440.5010
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
3491
5.99
5
Rodger
Clayton
11914
2934
39.99
22
SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;
{id:11911,{?rst:Nick,last:Toner}}{id:11914,{?rst:Rodger,last:Clayton}}
…
{cid:4150,{price:10.50,quan1ty:3}}{cid:11914,{price:12.25,quan1ty:27}}
…
Iden1calkeysshu?edtothesamereducer.Joindonereduce-side.
Expensivefromanetworku1liza1onstandpoint.
{id:11914,{?rst:Rodger,last:Clayton}}{cid:11914,{price:12.25,quan1ty:27}}
{id:11911,{?rst:Nick,last:Toner}}
{cid:4150,{price:10.50,quan1ty:3}}
M
M
R
R
…
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page15
BroadcastJoin
?StarschemasusedimensiontablessmallenoughtofitinRAM.
?Smalltablesheldinmemorybyallnodes.
?Singlepassthroughthelargetable.
?Usedforstar-schematypejoinscommoninDW.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page16
Whenbotharetoolargeformemory:
customer
order
?rst
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
11914
40.50
10
Rodger
Clayton
11914
12337
39.99
22
Verona
Hollen
11915
15912
40.50
10
SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;
Clusterandsortbythemostcommonjoinkey.
CREATETABLEorder(cidint,pricefloat,quantityint)
CLUSTEREDBY(cid)SORTEDBY(cid)INTO32BUCKETS;
CREATETABLEcustomer(idint,firststring,laststring)
CLUSTEREDBY(id)SORTEDBY(id)INTO32BUCKETS;
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page17
Hive’sClusteringandSorting
customer
order
?rst
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
11914
40.50
10
Rodger
Clayton
11914
12337
39.99
22
Verona
Hollen
11915
15912
40.50
10
SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;
Observa1on1:
Sor1ngbythejoinkeymakesjoinseasy.
Allpossiblematchesresideinthesameareaondisk.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page18
Hive’sClusteringandSorting
customer
order
?rst
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
11914
40.50
10
Rodger
Clayton
11914
12337
39.99
22
Verona
Hollen
11915
15912
40.50
10
SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;
Observa1on2:
Hashbucke1ngajoinkeyensuresallmatchingvaluesresideonthesamenode.
Equi-joinscanthenrunwithnoshu?e.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page19
ControllingDataLocalitywithHive
?Bucketing:
-Hashpartitionvaluesintoaconfigurablenumberofbuckets.
-Usuallycoupledwithsorting.
?Skews:
-Splitvaluesoutintoseparatefiles.
-Usedwhencertainvaluesarefrequentlyseen.
?ReplicationFactor:
-Increasereplicationfactortoacceleratereads.
-ControlledattheHDFSlayer.
?Sorting:
-Sortthevalueswithingivencolumns.
-GreatlyacceleratesquerywhenusedwithORCFilefilterpushdown.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page20
GuidelinesforArchitectingHiveData
TableSize
DataPro?le
QueryPaNern
Recommenda6on
Small
Hotdata
Any
Increasereplica1onfactor
Any
Any
Veryprecise?lters
Sortoncolumnmost
frequentlyusedinprecisequeries
Large
Any
Joinedtoanotherlargetable
Sortandbucketboth
tablesalongthejoinkey
Large
Onevalue>25%ofcountwithinhighcardinalitycolumn
Any
Splitthefrequentvalueintoaseparateskew
Large
Any
Queriestendtohavea
naturalboundarysuchasdate
Par11onthedataalongthenaturalboundary
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page21
HivePersistenceFormats
?Built-inFormats:
-ORCFile
-RCFile
-Avro
-DelimitedText
-RegularExpression
-S3Logfile
-TypedBytes
?3rd-PartyAddons:
-JSON-XML
Page22
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Hiveallowsmixedformat.
?UseCase:
-Ingestdatainawrite-optimizedformatlikeJSONordelimited.
-Everynight,runabatchjobtoconverttoread-optimizedORCFile.
Page23
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
ORCFile–EfficientColumnarLayout
Columnarformat
arrangescolumnsadjacentwithinthefileforcompressionandfastaccess.
LargeblocksizewellsuitedforHDFS.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
ORCFileAdvantages
?HighCompression
-Manytricksusedout-of-the-boxtoensurehighcompressionrates.
-RLE,dictionaryencoding,etc.
?HighPerformance
-InlineindexesrecordvaluerangeswithinblocksofORCFiledata.
-Filterpushdownallowsefficientscanningduringprecisequeries.
?FlexibleDataModel
-AllHivetypesincludingmaps,structsandunions.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page25
HighCompressionwithORCFile
oDatasetfromTPC-DS
Page26
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
SomeORCFileSamples
sale
id
6mestamp
productsk
storesk
amount
state
10000
2013-06-13T09:03:05
16775
670
$70.50
CA
10001
2013-06-13T09:03:05
10739
359
$52.99
IL
10002
2013-06-13T09:03:06
4671
606
$67.12
MA
10003
2013-06-13T09:03:08
7224
174
$96.85
CA
10004
2013-06-13T09:03:12
9354
123
$67.76
CA
10005
2013-06-13T09:03:18
1192
497
$25.73
IL
CREATETABLEsale(
idint,timestamptimestamp,productskint,storeskint,
amountdecimal,statestring)STOREDASorc;
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page27
ORCFileOptionsandDefaults
Key
Default
Notes
press
ZLIB
Highlevelcompression(oneofNONE,ZLIB,SNAPPY)
press.size
262,144
(=256KiB)
Numberofbytesineachcompressionchunk
orc.stripe.size
268,435,456(=256MiB)
Numberofbytesineachstripe
orc.row.index.stride
10,000
Numberofrowsbetweenindexentries(mustbe>=1,000)
orc.create.index
true
Whethertocreaterowindexes
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page28
NoCompression:FasterbutLarger
sale
id
6mestamp
productsk
storesk
amount
state
10000
2013-06-13T09:03:05
16775
670
$70.50
CA
10001
2013-06-13T09:03:05
10739
359
$52.99
IL
10002
2013-06-13T09:03:06
4671
606
$67.12
MA
10003
2013-06-13T09:03:08
7224
174
$96.85
CA
10004
2013-06-13T09:03:12
9354
123
$67.76
CA
10005
2013-06-13T09:03:18
1192
497
$25.73
IL
CREATETABLEsale(
idint,timestamptimestamp,productskint,storeskint,amountdecimal,statestring
)STOREDASorctblproperties("press"="NONE");
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page29
ColumnSortingtoFacilitateSkipping
sale
id
6mestamp
productsk
storesk
amount
state
10005
2013-06-13T09:03:18
1192
497
$25.73
IL
10002
2013-06-13T09:03:06
4671
606
$67.12
MA
10003
2013-06-13T09:03:08
7224
174
$96.85
CA
10004
2013-06-13T09:03:12
9354
123
$67.76
CA
10001
2013-06-13T09:03:05
10739
359
$52.99
IL
10000
2013-06-13T09:03:05
16775
670
$70.50
CA
CREATETABLEsale(
idint,timestamptimestamp,productskint,storeskint,
amountdecimal,statestring)STOREDASorc;
INSERTINTOsaleASSELECT*FROMstagingSORTBYproductsk;
ORCFileskippingspeedsquerieslike
WHEREproductsk=X,productskIN(Y,Z);etc.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page30
NotYourTraditionalDatabase
?TraditionalsolutiontoallRDBMSproblems:
-Putanindexonit!
YourOracleDBA
?DoingthisinHadoop=#fail
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page31
GoingFastinHadoop
?Hadoop:
-Reallygoodatcoordinatedsequentialscans.
-NorandomI/O.Traditionalindexprettymuchuseless.
?KeystospeedinHadoop:
-Sortingandskippingtaketheplaceofindexing.
-Minimizingdatashuffletheotherkeyconsideration.
?Skippingdata:
-Dividedataamongdifferentfileswhichcanbeprunedout.
-Partitions,bucketsandskews.
-Skiprecordsduringscansusingsmallembeddedindexes.
-AutomaticwhenyouuseORCFileformat.
-Sortdataaheadoftime.
-Simplifiesjoinsandskippingbecomesmoreeffective.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page32
DataLayoutConsiderationsforFastHive
SkipReads
MinimizeShuffle
ReduceLatency
PartitionTablesand/orUseSkew
SortandBucketonCommonJoinKeys
IncreaseReplicationFactorForHotData
SortSecondary
ColumnswhenUsing
ORCFile
UseBroadcastJoins
whenJoiningSmall
Tables
EnableShort-CircuitRead
TakeAdvantageofTez+TezService(Future)
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page33
PartitioningandVirtualColumns
?Partitioningmakesqueriesgofast.
?Youwillalmostalwaysusesomesortofpartitioning.
?Whenpartitioningyouwilluse1ormorevirtualcolumns.
#Noticehowxdateandstatearenot“real”columnnames.CREATETABLEsale(
idint,amountdecimal,...
)partitionedby(xdatestring,statestring);
?VirtualcolumnscausedirectoriestobecreatedinHDFS.
-Filesforthatpartitionarestoredwithinthatsubdirectory.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page34
LoadingDatawithVirtualColumns
?Bydefaultatleastonevirtualcolumnmustbehard-coded.
INSERTINTOsale(xdate=‘2013-03-01’,state=‘CA’)SELECT*FROMstaging_table
WHERExdate=‘2013-03-01’ANDstate=‘CA’;
?Youcanloadallpartitionsinoneshot:
-sethive.exec.dynamic.partition.mode=nonstrict;
-Warning:Youcaneasilyoverwhelmyourclusterthisway.
sethive.exec.dynamic.partition.mode=nonstrict;
INSERTINTOsale(xdate,state)SELECT*FROMstaging_table;
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page35
YouMayNeedtoRe-OrderColumns
?Virtualcolumnsmustbelastwithintheinserteddataset.
?YoucanusetheSELECTstatementtore-order.
INSERTINTOsale(xdate,state=‘CA’)SELECT
id,amount,other_stuff,
xdate,state
FROMstaging_tableWHEREstate=‘CA’;
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page36
TheMostEssentialHiveQueryTunings
Page37
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
TuneSplitSize–Always
?mapred.max.split.sizeandmapred.min.split.size
?Hiveprocessesdatainchunkssubjecttothesebounds.
?mintoolarge->Toofewmappers.
?maxtoosmall->Toomanymappers.
?Tunevariablesun6lmappersoccupy:
-Allmapslotsifyouownthecluster.
-Reasonablenumberofmapslotsifyoudon’t.
?Example:
-setmapred.max.split.size=100000000;
-setmapred.min.split.size=1000000;
?Manualtoday,automa6cinfutureversionofHive.
?Youwillneedtosettheseformostqueries.
Page38
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Tuneio.sort.mb–Sometimes
?HiveandMap/Reducemaintainsomeseparatebu?ers.
?IfHivemapsneedlotsoflocalmemoryyoumayneedtoshrinkmap/reducebu?ers.
?Ifyourmapsspill,tryitout.
?Example:
-setio.sort.mb=100;
Page39
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
OtherSettingsYouNeed
?Allthe6me:
-sethive.op1mize.mapjoin.mapreduce=true;
-sethive.op1mize.bucketmapjoin=true;
-sethive.op1mize.bucketmapjoin.sortedmerge=true;
-sethive.auto.convert.join=true;
-sethive.auto.convert.sortmerge.join=true;
-sethive.auto.convert.sortmerge.join.nocondi1onaltask=true;
?Whenbucke6ngdata:
-sethive.enforce.bucke1ng=true;
-sethive.enforce.sor1ng=true;
?TheseandmorearesetbydefaultinHDP1.3.
-Checkfortheminhive-site.xml
-Ifnotpresent,settheminyourqueryscript
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page40
CheckYourSettings
?InHiveshell:
-Seeallsettingswith“set;”
-Seeaparticularsettingbyaddingit.
-Changeasetting.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page41
ExampleWorkflow:CreateaStagingtable
CREATEEXTERNALTABLEpos_staging(
txnidSTRING,
txntimeSTRING,
givennameSTRING,
lastnameSTRING,
postalcodeSTRING,
storeidSTRING,
ind1STRING,
productidSTRING,
purchaseamountFLOAT,
creditcardSTRING
)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'LOCATION'/user/hdfs/staging_data/pos_staging';
Therawdataistheresultofinitialloadingortheoutputofa
mapreduceorpigjob.Wecreateanexternaltableovertheresultsofthatjobasweonlyintendtouseittoloadanoptimizedtable.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page42
ExampleWorkflow:Chooseapartitionscheme
hive>selectdistinctconcat(year(txntime),month(txntime))aspart_dt
frompos_staging;
…
OK
20121
201210
201211
201212
20122
20123
20124
20125
20126
20127
20128
20129
Timetaken:21.823seconds,Fetched:12row(s)
Executeaquerytodetermineifthepartitionchoicereturnsareasonableresult.Wewillusethisprojectiontocreatepartitionsforourdataset.YouwanttokeepyourpartitionslargeenoughtobeusefulinpartitionpruningandefficientforHDFSstorage.Hivehasconfigurableboundstoensureyoudonotexceedpernodeandtotalpartitioncounts(defaultsshown):
hive.exec.max.dynamic.partitions=1000
hive.exec.max.dynamic.partitions.pernode=100
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page43
ExampleWorkflow:Defineoptimizedtable
CREATETABLEfact_pos
(
txnidSTRING,
txntimeSTRING,
givennameSTRING,
lastnameSTRING,
postalcodeSTRING,
storeidSTRING,
ind1STRING,
productidSTRING,
purchaseamountFLOAT,
creditcardSTRING
)PARTITIONEDBY(part_dtSTRING)
CLUSTEREDBY(txnid)
SORTEDBY(txnid)
INTO24BUCKETS
STOREDASORCtblproperties("press"="SNAPPY");
Thepart_dtfieldisdefinedinthepartitionbyclauseandcannotbethesamenameasanyotherfields.Inthiscase,wewillbeperformingamodificationoftxntimetogenerateapartitionkey.Theclusterandsortedclausescontaintheonlykeyweintendtojointhetableon.WehavestoredasORCFilewithSnappycompression.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page44
ExampleWorkflow:LoadDataIntoOptimizedTable
sethive.enforce.sorting=true;
sethive.enforce.bucketing=true;
sethive.exec.dynamic.partition=true;
sethive.exec.dynamic.partition.mode=nonstrict;
setmapreduce.reduce.input.limit=-1;
FROMpos_staging
INSERTOVERWRITETABLEfact_pos
PARTITION(part_dt)
SELECT
txnid,
txntime,
givenname,
lastname,
postalcode,
storeid,
ind1,
productid,
purchaseamount,
creditcard,
concat(year(txntime),month(txntime))aspart_dt
SORTBYproductid;
Weusethiscommendtoloaddatafromourstagingtableintoour
optimizedORCFileformat.Notethatweareusingdynamicpartitioningwiththeprojectionofthetxntimefield.ThisresultsinaMapReducejobthatwillcopythestagingdataintoORCFileformatHivemanagedtable.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page45
ExampleWorkflow:Increasereplicationfactor
hadoopfs-setrep-R–w5/apps/hive/warehouse/fact_pos
Increasethereplicationfactorforthehighperformancetable.Thisincreasesthechancefordatalocality.Inthiscase,theincreaseinreplicationfactorisnotforadditionalresiliency.
Thisisatrade-offofstorageforperformance.
Infact,toconservespace,youmaychoosetoreducethe
replicationfactorforolderdatasetsorevendeletethem
altogether.Withtherawdatainplaceanduntouched,youcanalwaysrecreatetheORCFilehighperformancetables.MostusersplacethestepsinthisexampleworkflowintoanOoziejobtoautomatethework.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page46
ExampleWorkflow:EnablingShortCircuitRead
Inhdfs-site.xml(oryourcustomAmbarisettingsforHDFS,
restartserviceafter):
dfs.block.local-path-access.user=hdfs
dfs.client.read.shortcircuit=true
dfs.client.read.shortcircuit.skip.checksum=false
ShortCircuitreadsallowthemapperstobypasstheoverheadofopeningaporttothedatanodeifthedatais
local.Thepermissionsforthelocal
blockfilesneedtopermithdfstoreadthem(shouldbebydefaultalready)SeeHDFS-2246formoredetails.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page47
ExampleWorkflow:Executeyourquery
sethive.mapred.reduce.tasks.speculative.execution=false;
setio.sort.mb=300;
setmapreduce.reduce.input.limit=-1;
selectproductid,ROUND(SUM(purchaseamount),2)astotal
fromfact_pos
wherepart_dtbetween‘201210’and‘201212’
groupbyproductid
orderbytotaldesc
limit100;
…
OK
20535
39079
28970
45594
…
15649
47704
8140
3026.87
2959.69
2869.87
2821.15
2242.05
2241.22
2238.61
Timetaken:40.087seconds,Fetched:100row(s)
Inthecaseabove,wehaveasimplequeryexecutedtotestoutourtable.Wehavesomeexampleparameterssetbeforeourquery.ThegoodnewsisthatmostoftheparametersregardingjoinandengineoptimizationsarealreadysetforyouinHive0.11(HDP).The
io.sort.mbispresentedasanexampleofoneofthetunableparametersyoumaywanttochangeforthisparticularSQL(notethisvalueassumes2-3GBJVMsformappers).Wearealsopartitionpruningfortheholidayshoppingseason,OcttoDec.
DeepDivecontentbyHortonworks,Inc.islicensedundera
CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.
Page48
ExampleWorkflow:CheckExecutionPathinAmbari
YoucanchecktheexecutionpathinAmbari’sJobviewer.Thisgivesahighleveloverviewofthestagesandparticularnumberofmapandreducetasks.WithTez,italsoshowstask
numberandexecutionorder.Thecountsherearesmallasthisis
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 第五章 實(shí)驗(yàn)活動(dòng)5 不同價(jià)態(tài)含硫物質(zhì)的轉(zhuǎn)化 教學(xué)設(shè)計(jì) -2024-2025學(xué)年高一下學(xué)期化學(xué)人教版(2019)必修第二冊(cè)
- 《影子的秘密》(教學(xué)設(shè)計(jì))2023-2024學(xué)年教科版三年級(jí)下冊(cè)科學(xué)
- 《第四章 第4節(jié) 光的折射》教學(xué)設(shè)計(jì)-2023-2024學(xué)年初中物理人教版八年級(jí)上冊(cè)
- 2025至2030年中國(guó)水平葉片過(guò)濾機(jī)數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025至2030年中國(guó)橡膠絕緣套數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025年廣東江門(mén)中醫(yī)藥職業(yè)學(xué)院?jiǎn)握新殬I(yè)傾向性測(cè)試題庫(kù)學(xué)生專(zhuān)用
- 山東省泰安新泰市某校2023-2024學(xué)年高三上學(xué)期期末仿真模擬地理試題(解析版)
- 湖北省荊楚高中聯(lián)盟2024-2025學(xué)年高一上學(xué)期12月聯(lián)考地理試題(解析版)
- 湖北省新高考聯(lián)考協(xié)作體2024-2025學(xué)年高一上學(xué)期12月月考地理試題(解析版)
- 黑龍江省齊齊哈爾市2023-2024學(xué)年高二上學(xué)期1月期末考試地理試題(解析版)
- 規(guī)?;i場(chǎng)生物安全
- 維修基金使用合同范例
- c語(yǔ)言課件教學(xué)下載
- 2024購(gòu)房合同購(gòu)房定金合同
- 2024年全國(guó)中學(xué)生生物學(xué)聯(lián)賽試題含答案
- 高速公路施工現(xiàn)場(chǎng)安全管理制度
- 5.3應(yīng)用二元一次方程組-雞兔同籠教學(xué)設(shè)計(jì)-北師大版八年級(jí)數(shù)學(xué)上冊(cè)
- 加油站防雷、防靜電自查自糾方案
- 2024年中國(guó)解剖臺(tái)市場(chǎng)調(diào)查研究報(bào)告
- 第四單元平行與相交(單元測(cè)試)-2024-2025學(xué)年四年級(jí)上冊(cè)數(shù)學(xué)青島版
- 2024年密碼行業(yè)職業(yè)技能競(jìng)賽參考試題庫(kù)500題(含答案)
評(píng)論
0/150
提交評(píng)論