文案分析與火花_第1頁
文案分析與火花_第2頁
文案分析與火花_第3頁
文案分析與火花_第4頁
文案分析與火花_第5頁
已閱讀5頁,還剩21頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

Agenda

WhatClouderadoesforSparkEcosystem

AdvancedAnalyticswithSpark

SparkEngineeringinCloudera

ClouderaembracedSparkinearly2014

EngineeringwithInteltobroadenSparkecosystem

Hive-on-Spark

Pig-on-Spark

Spark-over-YARN

SparkStreamingReliability

GeneralSparkOptimization

HiveonSpark

Technology

Hive:“standard”SQLtoolinHadoop

Spark:next-gendistributedprocessingframework

Hive+Spark

Performance

Minimumfeaturegap

Industry

AlotofcustomersheavilyinvestinHive

WanttoleveragetheSparkengine

DesignPrinciples

NoorlimitedimpactonHive’sexistingcodepath

Maximizecodereuse

Minimumfeaturecustomization

Lowfuturemaintenancecost

ClassHierarchy

TaskCompiler

generates

Task

describedby

Work

MapRedCompiler

TezCompiler

MapRedTask

TezTask

MapRedWork

TezWork

SparkCompiler

SparkTask

SparkWork

Work–MetadataforTask

MapReduceWorkcontainsoneMapWorkandapossibleReduceWork

SparkWorkcontainsagraphofMapWorksandReduceWorks

MapWork1

ReduceWork1

MRJob1

MapWork2

ReduceWork2

MRJob2

Query:selectname,

MapWork1

ReduceWork1

ReduceWork2

sum(value)asvfromdec

groupbynameorderbyv;

SparkJob

SparkClientandSparkContext

SparkClient

TalkingtoSparkcluster

Supportlocal,local-cluster,standalone,yarn-cluster,yarn-client

Jobsubmission,monitoring,errorreporting,statistics,metrics,counters

SparkContext

CoreofSparkclient

Heavy-weighted,thread-unsafe

Designedforasingle-userapplication

Doesn’tworkinmulti-sessionenvironment

RemoteSparkContext

BeingcreatedandlivingoutsideHiveServer2

Inyarn-clustermode,Sparkcontextlivesinapplicationmaster(AM)

Otherwise,Sparkcontextlivesinaseparateprocess(otherthanHS2)

AM(RSC)

Node1

AM(RSC)

Node2

Node3

HiveServer2

YARNCluster

Session2

User2

Session1

User1

DataProcessingviaSpark

TreatTableasHadoopRDD(inputRDD)

ApplythefunctionthatwrapsMR’smap-sideprocessing

ShufflemapoutputusingSpark’stransformations(groupByKey,sortByKey,etc)

ApplythefunctionthatwrapsMR’sreduce-sideprocessing

SparkPlan

MapInput–encapsulateatable

MapTran–map-sideprocessing

ShuffleTran–shuffling

ReduceTran–reduce-sideprocessing

Query:Selectname,sum(value)asvfromdecgroupbynameorderbyv;

CurrentStatus

AllfunctionalityinHiveisimplemented

Firstroundofoptimizationiscompleted

Mapjoin,SMB

Splitgenerationandgrouping

CBO,vectorization

Moreoptimizationandbenchmarkingcoming

BetainCDH

/cloudera-labs/hive-on-spark/

/content/cloudera/en/documentation/hive-

spark/latest/PDF/hive-spark-get-started.pdf

AdvancedAnalyticswithSpark

WrittenbyClouderadatascienceteam

FirsteverbookbridgingMLwithHadoopecosystem

Focusingonusecasesandexamplesratherthanamanual

Targetfordatascientistsolvingrealwordanalysisproblems

GenerallyavailableinMay2015

AnalyzingBigData

Buildingamodeltodetectcreditcardfraudusingthousandsoffeaturesandbillionsoftransactions

Intelligentlyrecommendmillionsofproductstomillionsofusers

Estimatefinancialriskthroughsimulationsofportfoliosincludingmillionsofinstruments

Easilymanipulatedatafromthousandsofhumangenomestodetectgeneticassociationswithdisease

ChallengesofDataScience

Datapreprocessing

Variousfastdatafrommultiplesourcerequirespowerfuldatapipeline

Iteration

Fundamentalpartofdatascience

Acceleratingdiskdataloadingismuchhelpful

Fromlabtoproduction

Makedatausefultonon-datascientists

Modelsbecomepartoftheproductionserviceandmayneedtoberebuiltperiodicallyoreveninrealtime.

ValueatRisk

VaR(風(fēng)險價值或者風(fēng)險收益)

指在一定的持有期和給定的置信水平下,利率、匯率等市場風(fēng)險要素發(fā)生變化時可能對某項資金頭寸、資產(chǎn)組合或機構(gòu)造成的潛在最大損失。

例如,在持有期為1天、置信水平為99%的情況下,若所計算的風(fēng)險價值為10萬人民幣,則表明該銀行的資產(chǎn)組合在1天中的損失有99%的可能不會超過10萬人民幣。

IntroducedbyHarryMarkowitzin1952,NobelPrizeinEconomicsin1990

IllustrationforVaR

MethodsforCalculatingVaR

Variance-Covariance

HistoricalSimulation

MonteCarloSimulation

EstimatingthroughMonteCarloSimulation

instrument

Normalize

marketfactors Input

Modeling MonteCarlo

Simulation

Oil Bonds SNP SNP

Sampling

MonteCarloSimulationwithSpark

Normalizedata

Fillthemissingvalue

Transformthehistoricaldatatotwo-weeks’return

Modeling

Definefactorfeatures

Useregressionmodeltocomputefactorweights

Sampling

Takethecorrelationinformationbetweenthefactorsintoaccount

IfS&Pisdown,theDowislikelytobedownaswell

Simulation

Broadcastinstrumentstoeachnode

Parallelizetrialcomputationacrossworkers

Compu

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論