hive調(diào)優(yōu)-大數(shù)據(jù)文檔資料_第1頁(yè)
hive調(diào)優(yōu)-大數(shù)據(jù)文檔資料_第2頁(yè)
hive調(diào)優(yōu)-大數(shù)據(jù)文檔資料_第3頁(yè)
hive調(diào)優(yōu)-大數(shù)據(jù)文檔資料_第4頁(yè)
hive調(diào)優(yōu)-大數(shù)據(jù)文檔資料_第5頁(yè)
已閱讀5頁(yè),還剩177頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

Hive&Performance

TorontoHadoopUserGroupJuly232013

Presenter:

AdamMuise–Hortonworksamuise@

Page1

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Agenda

?Hive–WhatIsItGoodFor?

?Hive’sArchitectureandSQLCompatibility

?TurningHivePerformanceto11

?GetDataInandOutofHive

?HiveSecurity

?ProjectStinger–MakingHive100xFaster

?ConnectingtoHiveFromPopularTools

Page2

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Hive–SQLAnalyticsForAnyDataSize

StoreandQueryallDatainHive

UseExis6ngSQLTools

andExis6ngSQLProcesses

Weblog

Opera1onal/MPP

MobileSensor

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page3

Hive’sFocus

?ScalableSQLprocessingoverdatainHadoop

?Scalesto100PB+

?StructuredandUnstructureddata

Page4

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

ComparingHivewithRDBMS

Hive

RDBMS

SQLInterface.

SQLInterface.

Focusonanaly1cs.

Mayfocusononlineoranaly1cs.

Notransac1ons.

Transac1onsusuallysupported.

Par11onadds,norandomINSERTs.

In-Placeupdatesnotna1velysupported(butarepossible).

RandomINSERTandUPDATEsupported.

Distributedprocessingviamap/reduce.

Distributedprocessingvariesbyvendor(ifavailable).

Scalestohundredsofnodes.

Seldomscalebeyond20nodes.

Builtforcommodityhardware.

OQenbuiltonproprietaryhardware(especiallywhenscalingout).

Lowcostperpetabyte.

What’sapetabyte?

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page5

Agenda

?Hive–WhatIsItGoodFor?

?Hive’sArchitectureandSQLCompatibility

?TurningHivePerformanceto11

?GetDataInandOutofHive

?HiveSecurity

?ProjectStinger–MakingHive100xFaster

?ConnectingtoHiveFromPopularTools

Page6

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Hive:TheSQLInterfacetoHadoop

Hive

Compiler

2

Hive

Hive

OSQL

CLI

HiveServer2

Optimizer

Executor

1

JDBC/ODBC

WebUI

3

4

1

NameNode

JobTracker

Task

Tracker

2

3

Hadoop

UserissuesSQLquery

Hiveparsesandplansquery

DataNode

Map/Reduce

4

QueryconvertedtoMap/ReduceMap/ReducerunbyHadoop

Page7

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Hive:ReliableSQLProcessingatScale

HDFS

HDFS

Node1

Node1

Node1

SQL

Node2

Node2

Node2

Hive

Node3

Node3

Node3

Map/Reduce

Node4

TaskTracker

Node4

Node4

>

HDFS

Time1:

Job=50%Complete

ProgressStoredinHDFS

Time2:

Node3Fails,

Job=85%Complete

Time3:

JobmovestoNode4Job=100%Complete

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page8

SQLCoverage:SQL92withExtensions

Legend

Available

NewinHive0.11Roadmap

SQLDatatypes

SQLSeman6cs

INT

SELECT,LOAD,INSERTfromquery

TINYINT/SMALLINT/BIGINT

ExpressionsinWHEREandHAVING

BOOLEAN

GROUPBY,ORDERBY,SORTBY

FLOAT

CLUSTERBY,DISTRIBUTEBY

DOUBLE

Sub-queriesinFROMclause

STRING

GROUPBY,ORDERBY

BINARY

ROLLUPandCUBE

TIMESTAMP

UNION

ARRAY,MAP,STRUCT,UNION

LEFT,RIGHTandFULLINNER/OUTERJOIN

DECIMAL

CROSSJOIN,LEFTSEMIJOIN

CHAR

Windowingfunc1ons(OVER,RANK,etc.)

VARCHAR

Sub-queriesforIN/NOTIN,HAVING

DATE

EXISTS/NOTEXISTS

INTERSECT,EXCEPT

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page9

Agenda

?Hive–WhatIsItGoodFor?

?Hive’sArchitectureandSQLCompatibility

?TurningHivePerformanceto11

?GetDataInandOutofHive

?HiveSecurity

?ProjectStinger–MakingHive100xFaster

?ConnectingtoHiveFromPopularTools

Page10

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

DataAbstractionsinHive

Op1onalPerTable

Database

Par11on

Par11on

Table

Par11on

UnskewedKeys

Table

SkewedKeys

Bucket

Bucket

Bucket

Par11ons,bucketsandskewsfacilitate

faster,moredirectdataaccess.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page11

“Iheardyoushouldavoidjoins…”

?“Joinsareevil”–CalHenderson

-Joinsshouldbeavoidedinonlinesystems.

?Joinsareunavoidableinanalytics.

-Makingjoinsfastisthekeydesignpoint.

Page12

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

QuickRefresheronJoins

customerorder

?rstlastidcidpricequan6ty

Nick

11911

4150

10.50

3

Toner

VeronaHollen119151191440.5010

Jessie

Simonds

11912

11914

12.25

27

Kasi

Lamers

11913

3491

5.99

5

Rodger

Clayton

11914

2934

39.99

22

SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;

Joinsmatchvaluesfromonetableagainstvaluesinanothertable.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page13

HiveJoinStrategies

Type

Approach

Pros

Cons

Shu?eJoin

Joinkeysareshu?edusingmap/reduceandjoins

performedjoinside.

Worksregardlessofdatasizeorlayout.

Mostresource-

intensiveand

slowestjointype.

BroadcastJoin

Smalltablesareloadedinto

memoryinallnodes,

mapperscansthroughthelargetableandjoins.

Veryfast,singlescan

throughlargesttable.

Allbutonetablemustbesmallenoughto?tinRAM.

Sort-

Merge-

Bucket

Join

Mapperstakeadvantageofco-loca1onofkeystodo

e?cientjoins.

Veryfastfortablesofanysize.

Datamustbesortedandbucketedaheadof1me.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page14

ShuffleJoinsinMapReduce

customerorder

?rstlastidcidpricequan6ty

Nick

11911

4150

10.50

3

Toner

VeronaHollen119151191440.5010

Jessie

Simonds

11912

11914

12.25

27

Kasi

Lamers

11913

3491

5.99

5

Rodger

Clayton

11914

2934

39.99

22

SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;

{id:11911,{?rst:Nick,last:Toner}}{id:11914,{?rst:Rodger,last:Clayton}}

{cid:4150,{price:10.50,quan1ty:3}}{cid:11914,{price:12.25,quan1ty:27}}

Iden1calkeysshu?edtothesamereducer.Joindonereduce-side.

Expensivefromanetworku1liza1onstandpoint.

{id:11914,{?rst:Rodger,last:Clayton}}{cid:11914,{price:12.25,quan1ty:27}}

{id:11911,{?rst:Nick,last:Toner}}

{cid:4150,{price:10.50,quan1ty:3}}

M

M

R

R

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page15

BroadcastJoin

?StarschemasusedimensiontablessmallenoughtofitinRAM.

?Smalltablesheldinmemorybyallnodes.

?Singlepassthroughthelargetable.

?Usedforstar-schematypejoinscommoninDW.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page16

Whenbotharetoolargeformemory:

customer

order

?rst

last

id

cid

price

quan6ty

Nick

Toner

11911

4150

10.50

3

Jessie

Simonds

11912

11914

12.25

27

Kasi

Lamers

11913

11914

40.50

10

Rodger

Clayton

11914

12337

39.99

22

Verona

Hollen

11915

15912

40.50

10

SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;

Clusterandsortbythemostcommonjoinkey.

CREATETABLEorder(cidint,pricefloat,quantityint)

CLUSTEREDBY(cid)SORTEDBY(cid)INTO32BUCKETS;

CREATETABLEcustomer(idint,firststring,laststring)

CLUSTEREDBY(id)SORTEDBY(id)INTO32BUCKETS;

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page17

Hive’sClusteringandSorting

customer

order

?rst

last

id

cid

price

quan6ty

Nick

Toner

11911

4150

10.50

3

Jessie

Simonds

11912

11914

12.25

27

Kasi

Lamers

11913

11914

40.50

10

Rodger

Clayton

11914

12337

39.99

22

Verona

Hollen

11915

15912

40.50

10

SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;

Observa1on1:

Sor1ngbythejoinkeymakesjoinseasy.

Allpossiblematchesresideinthesameareaondisk.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page18

Hive’sClusteringandSorting

customer

order

?rst

last

id

cid

price

quan6ty

Nick

Toner

11911

4150

10.50

3

Jessie

Simonds

11912

11914

12.25

27

Kasi

Lamers

11913

11914

40.50

10

Rodger

Clayton

11914

12337

39.99

22

Verona

Hollen

11915

15912

40.50

10

SELECT*FROMcustomerjoinorderONcustomer.id=order.cid;

Observa1on2:

Hashbucke1ngajoinkeyensuresallmatchingvaluesresideonthesamenode.

Equi-joinscanthenrunwithnoshu?e.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page19

ControllingDataLocalitywithHive

?Bucketing:

-Hashpartitionvaluesintoaconfigurablenumberofbuckets.

-Usuallycoupledwithsorting.

?Skews:

-Splitvaluesoutintoseparatefiles.

-Usedwhencertainvaluesarefrequentlyseen.

?ReplicationFactor:

-Increasereplicationfactortoacceleratereads.

-ControlledattheHDFSlayer.

?Sorting:

-Sortthevalueswithingivencolumns.

-GreatlyacceleratesquerywhenusedwithORCFilefilterpushdown.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page20

GuidelinesforArchitectingHiveData

TableSize

DataPro?le

QueryPaNern

Recommenda6on

Small

Hotdata

Any

Increasereplica1onfactor

Any

Any

Veryprecise?lters

Sortoncolumnmost

frequentlyusedinprecisequeries

Large

Any

Joinedtoanotherlargetable

Sortandbucketboth

tablesalongthejoinkey

Large

Onevalue>25%ofcountwithinhighcardinalitycolumn

Any

Splitthefrequentvalueintoaseparateskew

Large

Any

Queriestendtohavea

naturalboundarysuchasdate

Par11onthedataalongthenaturalboundary

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page21

HivePersistenceFormats

?Built-inFormats:

-ORCFile

-RCFile

-Avro

-DelimitedText

-RegularExpression

-S3Logfile

-TypedBytes

?3rd-PartyAddons:

-JSON-XML

Page22

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Hiveallowsmixedformat.

?UseCase:

-Ingestdatainawrite-optimizedformatlikeJSONordelimited.

-Everynight,runabatchjobtoconverttoread-optimizedORCFile.

Page23

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

ORCFile–EfficientColumnarLayout

Columnarformat

arrangescolumnsadjacentwithinthefileforcompressionandfastaccess.

LargeblocksizewellsuitedforHDFS.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

ORCFileAdvantages

?HighCompression

-Manytricksusedout-of-the-boxtoensurehighcompressionrates.

-RLE,dictionaryencoding,etc.

?HighPerformance

-InlineindexesrecordvaluerangeswithinblocksofORCFiledata.

-Filterpushdownallowsefficientscanningduringprecisequeries.

?FlexibleDataModel

-AllHivetypesincludingmaps,structsandunions.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page25

HighCompressionwithORCFile

oDatasetfromTPC-DS

Page26

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

SomeORCFileSamples

sale

id

6mestamp

productsk

storesk

amount

state

10000

2013-06-13T09:03:05

16775

670

$70.50

CA

10001

2013-06-13T09:03:05

10739

359

$52.99

IL

10002

2013-06-13T09:03:06

4671

606

$67.12

MA

10003

2013-06-13T09:03:08

7224

174

$96.85

CA

10004

2013-06-13T09:03:12

9354

123

$67.76

CA

10005

2013-06-13T09:03:18

1192

497

$25.73

IL

CREATETABLEsale(

idint,timestamptimestamp,productskint,storeskint,

amountdecimal,statestring)STOREDASorc;

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page27

ORCFileOptionsandDefaults

Key

Default

Notes

press

ZLIB

Highlevelcompression(oneofNONE,ZLIB,SNAPPY)

press.size

262,144

(=256KiB)

Numberofbytesineachcompressionchunk

orc.stripe.size

268,435,456(=256MiB)

Numberofbytesineachstripe

orc.row.index.stride

10,000

Numberofrowsbetweenindexentries(mustbe>=1,000)

orc.create.index

true

Whethertocreaterowindexes

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page28

NoCompression:FasterbutLarger

sale

id

6mestamp

productsk

storesk

amount

state

10000

2013-06-13T09:03:05

16775

670

$70.50

CA

10001

2013-06-13T09:03:05

10739

359

$52.99

IL

10002

2013-06-13T09:03:06

4671

606

$67.12

MA

10003

2013-06-13T09:03:08

7224

174

$96.85

CA

10004

2013-06-13T09:03:12

9354

123

$67.76

CA

10005

2013-06-13T09:03:18

1192

497

$25.73

IL

CREATETABLEsale(

idint,timestamptimestamp,productskint,storeskint,amountdecimal,statestring

)STOREDASorctblproperties("press"="NONE");

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page29

ColumnSortingtoFacilitateSkipping

sale

id

6mestamp

productsk

storesk

amount

state

10005

2013-06-13T09:03:18

1192

497

$25.73

IL

10002

2013-06-13T09:03:06

4671

606

$67.12

MA

10003

2013-06-13T09:03:08

7224

174

$96.85

CA

10004

2013-06-13T09:03:12

9354

123

$67.76

CA

10001

2013-06-13T09:03:05

10739

359

$52.99

IL

10000

2013-06-13T09:03:05

16775

670

$70.50

CA

CREATETABLEsale(

idint,timestamptimestamp,productskint,storeskint,

amountdecimal,statestring)STOREDASorc;

INSERTINTOsaleASSELECT*FROMstagingSORTBYproductsk;

ORCFileskippingspeedsquerieslike

WHEREproductsk=X,productskIN(Y,Z);etc.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page30

NotYourTraditionalDatabase

?TraditionalsolutiontoallRDBMSproblems:

-Putanindexonit!

YourOracleDBA

?DoingthisinHadoop=#fail

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page31

GoingFastinHadoop

?Hadoop:

-Reallygoodatcoordinatedsequentialscans.

-NorandomI/O.Traditionalindexprettymuchuseless.

?KeystospeedinHadoop:

-Sortingandskippingtaketheplaceofindexing.

-Minimizingdatashuffletheotherkeyconsideration.

?Skippingdata:

-Dividedataamongdifferentfileswhichcanbeprunedout.

-Partitions,bucketsandskews.

-Skiprecordsduringscansusingsmallembeddedindexes.

-AutomaticwhenyouuseORCFileformat.

-Sortdataaheadoftime.

-Simplifiesjoinsandskippingbecomesmoreeffective.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page32

DataLayoutConsiderationsforFastHive

SkipReads

MinimizeShuffle

ReduceLatency

PartitionTablesand/orUseSkew

SortandBucketonCommonJoinKeys

IncreaseReplicationFactorForHotData

SortSecondary

ColumnswhenUsing

ORCFile

UseBroadcastJoins

whenJoiningSmall

Tables

EnableShort-CircuitRead

TakeAdvantageofTez+TezService(Future)

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page33

PartitioningandVirtualColumns

?Partitioningmakesqueriesgofast.

?Youwillalmostalwaysusesomesortofpartitioning.

?Whenpartitioningyouwilluse1ormorevirtualcolumns.

#Noticehowxdateandstatearenot“real”columnnames.CREATETABLEsale(

idint,amountdecimal,...

)partitionedby(xdatestring,statestring);

?VirtualcolumnscausedirectoriestobecreatedinHDFS.

-Filesforthatpartitionarestoredwithinthatsubdirectory.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page34

LoadingDatawithVirtualColumns

?Bydefaultatleastonevirtualcolumnmustbehard-coded.

INSERTINTOsale(xdate=‘2013-03-01’,state=‘CA’)SELECT*FROMstaging_table

WHERExdate=‘2013-03-01’ANDstate=‘CA’;

?Youcanloadallpartitionsinoneshot:

-sethive.exec.dynamic.partition.mode=nonstrict;

-Warning:Youcaneasilyoverwhelmyourclusterthisway.

sethive.exec.dynamic.partition.mode=nonstrict;

INSERTINTOsale(xdate,state)SELECT*FROMstaging_table;

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page35

YouMayNeedtoRe-OrderColumns

?Virtualcolumnsmustbelastwithintheinserteddataset.

?YoucanusetheSELECTstatementtore-order.

INSERTINTOsale(xdate,state=‘CA’)SELECT

id,amount,other_stuff,

xdate,state

FROMstaging_tableWHEREstate=‘CA’;

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page36

TheMostEssentialHiveQueryTunings

Page37

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

TuneSplitSize–Always

?mapred.max.split.sizeandmapred.min.split.size

?Hiveprocessesdatainchunkssubjecttothesebounds.

?mintoolarge->Toofewmappers.

?maxtoosmall->Toomanymappers.

?Tunevariablesun6lmappersoccupy:

-Allmapslotsifyouownthecluster.

-Reasonablenumberofmapslotsifyoudon’t.

?Example:

-setmapred.max.split.size=100000000;

-setmapred.min.split.size=1000000;

?Manualtoday,automa6cinfutureversionofHive.

?Youwillneedtosettheseformostqueries.

Page38

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Tuneio.sort.mb–Sometimes

?HiveandMap/Reducemaintainsomeseparatebu?ers.

?IfHivemapsneedlotsoflocalmemoryyoumayneedtoshrinkmap/reducebu?ers.

?Ifyourmapsspill,tryitout.

?Example:

-setio.sort.mb=100;

Page39

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

OtherSettingsYouNeed

?Allthe6me:

-sethive.op1mize.mapjoin.mapreduce=true;

-sethive.op1mize.bucketmapjoin=true;

-sethive.op1mize.bucketmapjoin.sortedmerge=true;

-sethive.auto.convert.join=true;

-sethive.auto.convert.sortmerge.join=true;

-sethive.auto.convert.sortmerge.join.nocondi1onaltask=true;

?Whenbucke6ngdata:

-sethive.enforce.bucke1ng=true;

-sethive.enforce.sor1ng=true;

?TheseandmorearesetbydefaultinHDP1.3.

-Checkfortheminhive-site.xml

-Ifnotpresent,settheminyourqueryscript

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page40

CheckYourSettings

?InHiveshell:

-Seeallsettingswith“set;”

-Seeaparticularsettingbyaddingit.

-Changeasetting.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page41

ExampleWorkflow:CreateaStagingtable

CREATEEXTERNALTABLEpos_staging(

txnidSTRING,

txntimeSTRING,

givennameSTRING,

lastnameSTRING,

postalcodeSTRING,

storeidSTRING,

ind1STRING,

productidSTRING,

purchaseamountFLOAT,

creditcardSTRING

)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'|'LOCATION'/user/hdfs/staging_data/pos_staging';

Therawdataistheresultofinitialloadingortheoutputofa

mapreduceorpigjob.Wecreateanexternaltableovertheresultsofthatjobasweonlyintendtouseittoloadanoptimizedtable.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page42

ExampleWorkflow:Chooseapartitionscheme

hive>selectdistinctconcat(year(txntime),month(txntime))aspart_dt

frompos_staging;

OK

20121

201210

201211

201212

20122

20123

20124

20125

20126

20127

20128

20129

Timetaken:21.823seconds,Fetched:12row(s)

Executeaquerytodetermineifthepartitionchoicereturnsareasonableresult.Wewillusethisprojectiontocreatepartitionsforourdataset.YouwanttokeepyourpartitionslargeenoughtobeusefulinpartitionpruningandefficientforHDFSstorage.Hivehasconfigurableboundstoensureyoudonotexceedpernodeandtotalpartitioncounts(defaultsshown):

hive.exec.max.dynamic.partitions=1000

hive.exec.max.dynamic.partitions.pernode=100

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page43

ExampleWorkflow:Defineoptimizedtable

CREATETABLEfact_pos

(

txnidSTRING,

txntimeSTRING,

givennameSTRING,

lastnameSTRING,

postalcodeSTRING,

storeidSTRING,

ind1STRING,

productidSTRING,

purchaseamountFLOAT,

creditcardSTRING

)PARTITIONEDBY(part_dtSTRING)

CLUSTEREDBY(txnid)

SORTEDBY(txnid)

INTO24BUCKETS

STOREDASORCtblproperties("press"="SNAPPY");

Thepart_dtfieldisdefinedinthepartitionbyclauseandcannotbethesamenameasanyotherfields.Inthiscase,wewillbeperformingamodificationoftxntimetogenerateapartitionkey.Theclusterandsortedclausescontaintheonlykeyweintendtojointhetableon.WehavestoredasORCFilewithSnappycompression.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page44

ExampleWorkflow:LoadDataIntoOptimizedTable

sethive.enforce.sorting=true;

sethive.enforce.bucketing=true;

sethive.exec.dynamic.partition=true;

sethive.exec.dynamic.partition.mode=nonstrict;

setmapreduce.reduce.input.limit=-1;

FROMpos_staging

INSERTOVERWRITETABLEfact_pos

PARTITION(part_dt)

SELECT

txnid,

txntime,

givenname,

lastname,

postalcode,

storeid,

ind1,

productid,

purchaseamount,

creditcard,

concat(year(txntime),month(txntime))aspart_dt

SORTBYproductid;

Weusethiscommendtoloaddatafromourstagingtableintoour

optimizedORCFileformat.Notethatweareusingdynamicpartitioningwiththeprojectionofthetxntimefield.ThisresultsinaMapReducejobthatwillcopythestagingdataintoORCFileformatHivemanagedtable.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page45

ExampleWorkflow:Increasereplicationfactor

hadoopfs-setrep-R–w5/apps/hive/warehouse/fact_pos

Increasethereplicationfactorforthehighperformancetable.Thisincreasesthechancefordatalocality.Inthiscase,theincreaseinreplicationfactorisnotforadditionalresiliency.

Thisisatrade-offofstorageforperformance.

Infact,toconservespace,youmaychoosetoreducethe

replicationfactorforolderdatasetsorevendeletethem

altogether.Withtherawdatainplaceanduntouched,youcanalwaysrecreatetheORCFilehighperformancetables.MostusersplacethestepsinthisexampleworkflowintoanOoziejobtoautomatethework.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page46

ExampleWorkflow:EnablingShortCircuitRead

Inhdfs-site.xml(oryourcustomAmbarisettingsforHDFS,

restartserviceafter):

dfs.block.local-path-access.user=hdfs

dfs.client.read.shortcircuit=true

dfs.client.read.shortcircuit.skip.checksum=false

ShortCircuitreadsallowthemapperstobypasstheoverheadofopeningaporttothedatanodeifthedatais

local.Thepermissionsforthelocal

blockfilesneedtopermithdfstoreadthem(shouldbebydefaultalready)SeeHDFS-2246formoredetails.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page47

ExampleWorkflow:Executeyourquery

sethive.mapred.reduce.tasks.speculative.execution=false;

setio.sort.mb=300;

setmapreduce.reduce.input.limit=-1;

selectproductid,ROUND(SUM(purchaseamount),2)astotal

fromfact_pos

wherepart_dtbetween‘201210’and‘201212’

groupbyproductid

orderbytotaldesc

limit100;

OK

20535

39079

28970

45594

15649

47704

8140

3026.87

2959.69

2869.87

2821.15

2242.05

2241.22

2238.61

Timetaken:40.087seconds,Fetched:100row(s)

Inthecaseabove,wehaveasimplequeryexecutedtotestoutourtable.Wehavesomeexampleparameterssetbeforeourquery.ThegoodnewsisthatmostoftheparametersregardingjoinandengineoptimizationsarealreadysetforyouinHive0.11(HDP).The

io.sort.mbispresentedasanexampleofoneofthetunableparametersyoumaywanttochangeforthisparticularSQL(notethisvalueassumes2-3GBJVMsformappers).Wearealsopartitionpruningfortheholidayshoppingseason,OcttoDec.

DeepDivecontentbyHortonworks,Inc.islicensedundera

CreativeCommonsAttribution-ShareAlike3.0UnportedLicense.

Page48

ExampleWorkflow:CheckExecutionPathinAmbari

YoucanchecktheexecutionpathinAmbari’sJobviewer.Thisgivesahighleveloverviewofthestagesandparticularnumberofmapandreducetasks.WithTez,italsoshowstask

numberandexecutionorder.Thecountsherearesmallasthisis

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論