專題論壇：大數(shù)據(jù)

上傳人：小*** IP屬地：江蘇上傳時間：2023-09-23 格式：PPT 頁數(shù)：87 大小：10.39MB 積分：40 舉報 版權(quán)申訴

已閱讀5頁，還剩82頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

Big

Data

Smart

Model:Beauty

and

the

BeastProf.

Yike

GuoDepartment

ComputingImperial

College

LondonModel

Mathematical

Representation

SimplifiedPhysical

World

Modelling

essential

and

inseparable

part

all

scientific

activity.

scientific

model

seeks

represent

empirical

objects,

phenomena,

and

physical

processes

logical

and

objective

way

understand

the

world

object

(called

target

T),

modelM

simplified

mathematical

representation

it.

Model

the

result

abstraction

from

observations

made,

and

it’s

used

give

prediction

Human

SensorHuman

Machine

Human

Machine.No

Model

Perfect:

Inherent

Uncertainty

These

targets

consist

set

continuous

phenomena

(in

both

time

and

space),

and

they

typically

produce

rich

signals.

Because

the

continuity

both

time

and

space

target,

the

signals

are

principle

infinite.

But

observations

(

e.g.

sensor

readings

)

are

made

discrete

points

time

and

space,

they

are

incomprehensive,

and

approximate,

which

brings

the

“uncertainty”.

Overfitting

Underfitting:

When

learning

model

from

observations,

such

learning

nonlinear

regression

model,

need

choose

the

parameters

such

Considering

the

fact

that

the

information

from

observations

partial

hard

make

perfect

choice

Such

imperfectness

causes

the

problem

model

error,

underfitting

(small

and

overfitting

(large

k).?

Simplification:

From

observations,

project

from

multi-dimensional

world

simplified

model

with

significant

reduced

dimensionality

focus

the

features

properties

are

interested

in.Nonlinearregression:

K-order

polynomial

George

Box

(statistician)

“All

models

are

wrong,

but

some

areuseful.”

Only

models,

from

cosmological

equations

theories

humanbehavior,

seemed

able

consistently,

imperfectly,

explain

the

worldaround

us.

1980

Peter

Norvig

(Google)

"All

models

are

wrong,

and

increasinglyyou

can

succeed

without

them."

2008

Chris

Anderson

(Wired)

There

now

better

way.

Petabytesallow

say:

"Correlation

enough."

can

stop

looking

for

models.We

can

analyze

the

data

without

hypotheses

about

what

might

show.

Wecan

throw

the

numbers

into

the

biggest

computing

clusters

the

world

hasever

seen

and

let

statistical

algorithms

find

patterns

where

science

cannot.(The

Data

Deluge

Makes

the

Scientific

Method

Obsolete)20124So,

Why

Model

?The

Google

ArgumentAt

the

petabyte

scale,

information

not

matter

simple

three-

and

four-dimensionaltaxonomy

and

order

but

dimensionally

agnostic

statistics.

calls

for

entirely

differentapproach,

one

that

requires

lose

the

tether

data

something

that

can

visualizedin

its

totality.

forces

view

data

mathematically

first

and

establish

context

for

later.For

instance,

Google

conquered

the

advertising

world

with

nothing

than

appliedmathematics.

didn't

pretend

know

anything

about

the

culture

and

conventions

ofadvertising

—

just

assumed

that

better

data,

with

better

analytical

tools,

would

win

the

day.And

Google

was

right.Google's

founding

philosophy

that

don't

know

why

this

page

better

than

thatone:

the

statistics

incoming

links

say

is,

that's

good

enough.

semantic

orcausal

analysis

required.

That's

why

Google

can

translate

languages

without

actually"knowing"

them

(given

equal

corpus

data,

Google

can

translate

Klingon

into

Farsi

aseasily

can

translate

French

into

German).

And

why

can

match

ads

contentwithout

any

knowledge

assumptions

about

the

ads

the

content.Model

Free

Sensor

Informatics

Query

Driventime10am10am

..10amid12..7temp

…

29Database

Table

raw-dataSensorNetwork3.

Write

output

file/back

the

database4.

Write

data

processing

tools

process/aggregate

the

output

(maybe

using

User1.

Extract

all

readings

into

file2.

Run

MATLAB/R/other

data

processing

tools

DB)

Decide

new

data

acquire

RepeatModel-free

sensing

treats

the

sensory

system

database,

and

sensing

querying

fetch

data

from

physicalworld.

One

the

leading

vendors

[Crossbow]

bundling

query

processor

with

their

devices.Wikisensing

Model

Free

Sensor

Informatics

SystemBased

Big

Data

ArchitectureModel

Free

Sensing

Super

Inefficient?

Data

misrepresentation

without

model?

Latent

information

missing

without

model?

High

demand

computation/storage

without

model?

Require

too

much

interoperability

between

sensorsand

analyticsBayesian:

Data

Not

the

Enemy

Models

Rather

aGreat

Supporter!Bayesian

probability

formalism

that

allows

reason

about

beliefs

models

underconditions

uncertainty

based

the

observations

(data)

.If

have

observed

that

particular

event

has

happened,

such

Britain

coming

10th

themedal

table

the

2004

Olympics,

then

there

uncertainty

about

it.However,

suppose

the

statement

“Britain

sweeps

the

boards

2012

London

Olympics,winning

than

Gold

Medals!“

made

before

28th

JulySince

this

statement

about

future

event,

nobody

can

state

with

any

certainty

whether

ornot

true.

Different

people

may

have

different

beliefs

the

statement

depending

theirspecific

knowledge

factors

that

might

effect

its

likelihoodThe

belief’s

the

model

were

changing

daily

based

the

performance

data

available

eachday.

the

August,

most

people’s

belief

this

model

should

almost

80%Thus,

general,

person's

subjective

belief

statement

will

depend

some

body

ofknowledge

write

this

P(a|K).

Henry's

belief

different

from

Marcel's

because

theyare

using

different

K's.

However,

even

they

were

using

the

same

they

might

still

havedifferent

beliefs

a.The

expression

P(a|K)

thus

represents

belief

measure.

Sometimes,

for

simplicity,

when

Kremains

constant

just

write

P(a),

but

you

must

aware

that

this

simplification.Model

and

Data

Interaction

Bayesian

Inference10?Bayes

Rule:

Interaction

between

data

and

model?Learning

Sequence

Interactionsp(Y

)

p(Y)P(

Big

Data

Meets

Smart

Models

Bayesian

Approachtowards

Sensor

Informatics?We

need

model

the

representation

our

knowledge

far?????Data

the

observations

which

may

revise

our

belief

the

models

haveAnalysis

assessing

our

belief

and

updating

our

models

make

them

believableSensing

acquiring

needed

data

update

(enrich)

modelsModels

are

learned

from

data

(observations)

scientists

(theoretical

abstraction)

machine

(machinelearning)

Models

are

hypothesis

(

when

making

new

observation)

Models

are

knowledge

(when

established

belief)Sensor

Informatics:

Sensing

management

Managing

the

“neediness”

when

and

where

sense

Sensing

analytics

Managing

model

updating

how

enrich

models

with

observations

Reasoning

Decision

making

based

integration

trusted

models

?P(M

P(D

)

P(M)

P(D)

Surprising

Event

When

Observation

Does

not

Fit

Known

Model

Posterior

and

prior

(P(M|D)

P(M)

)

has

great

variance

surprise!How

great

variance?

Surprise

threshold

αKullback-Leibler

divergence:Other

methods:

signficant

level,

Chebyshev’s

Theorem,

…

From

model,

get

C(A,

(e.g.

multivariate

Gaussian

distribution)

100mm

50mmModel

consistentA:

100mmB:

500mmSurprise!Camera

example:

Image

Analog

Signal

->Digital

Data

Compressed

Data

InformationWhy

sensing

much

data

and

then

throw

themaway?Why

not

sensing

information

directly?Using

Compressive

Sensing

Technology

OptimizeObservations

Compressive

sensing:

Take

the

advantage

sparseness,

solve

the

under-determined

signals

with

just

small

amount

measurement.

Unobserved

behavior

(behavior

not

captured

the

current

model)

typically

sparse.Reconstruction

method:

L1-min,

Bayesian

CS.Sensing

data

enough

when

can

recover

the

need

information

through

compressive

sensing.Ψ:

Matrix

built

from

the

modelΦ:

Placement

MatrixHow

Update

Model

–

Parameter

Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC

2011

21:15:23NODAL

SOLUTIONSTEP=360SUB

=1TIME=1800TEMP

(AVG)RSYS=0SMN

=131.03SMX

=646.41

XEstimating

parameter

maximize

the

likelihoodof

data

given

the

model:Model

Example

Digital

CityModelling

City

Life

via

Causality

C(eA,

eB)

used

for

predict

current

value

location

(A)

whenanother

location

(B)

value

given

Location

physical

logical

locations

with

causality

(through

sensory

cortex)(city

areas,

Relationship

topology

(geo

topology

between

and

diffusion

Structure

)

Event:

events,

which

the

dynamics

observable

signal

f(E)

(heavyrainfall)Ontologies

are

adopted

represent

locations

relationships

R*events

and

signals

S.Diffusion:

event

e1∈

n1causes

another

event

∈

n2,when

two

nodes

n1,

arelinked.

Digital

City

Model

looking

into

the

detailsSystem

(L,

E)Model

M(T)

(G,

B)Training

for

causality

use

Bayesian

network

represent

theconditional

independencies

between

cause

and

target

variables:1.

Gaussian

Mixture

Models

(GMMs),

estimated

via

expectationmaximization

(EM)

Gaussian

Process

with

Bayesian

Inference.

When

the

surprise

threshold

Diversity

detected

identify

the

incorrect

causality

C(el,

ep),

which

sparse

Compressive

sensing

approachNew

observation->

measurement

thatcould

revise

model

space

tomaximize

the

likelihood

observations

Focusing

diversityPlacementModel

Updating

Model

Driven

Sensing

Surprise

The

dynamics

model

update:

Surprise

Sensing

Model

Updating

The

goal

for

sensing:

Capturingsurprise

The

goal

analysis

RevisingmodelA

model

cannot

overfit

underfit,

when

there

diversity,

could

updated->

consistent

with

the

universe

(target)Model

UpdateIt’s

Bayesian:

P(M,

P(D

P(M,

P(D)T:

target,

model,

top-down

parameter*

When

fixed:

P(M

P(D

P(M)

P(D)->

The

variance

between

posterior

and

prior

“surprise”->

bottom-up

attention

model

update

(data

assimilation):combining

observations

the

current

state

system

with

the

resultsfrom

model

(the

forecast)

produce

analysis.

The

model

thenadvanced

time

and

its

result

becomes

the

forecast

the

nextanalysis

cycle*

When

updated:

P(M,

P(M

?)P(?)->

top-down

attention

(alertness)

model

updateAdaptive

Observation:

Sensing

and

Numerical

ModellingCityGML

Ontology

GIS

Geometry

meshBuilding

Initial

Model

and

Making

Prediction

bySimulationsSetting

boundary

conditions,

numerical

schemas,

model

parameters,

etc.Simulation24

Building

Case

(Fine

Mesh

–

600000

Nodes):

ProcessorsSimulationMoving

Vehicles

and

Scalar

Dispersions

Street

CanyonsUsing

Sensor

Verify

the

Prediction

Results

theModel

Sensing:

Acquiring

data

get

posterior

model,

for

validate

(consistent)

update

model

P(M

P(D

P(M)

P(D)Data

sensingModelvalidateupdateNew

WikiSensing:

Elastic

Sensing

Environment

forLarge

Scale

Sensor

Informatics?

Elastic

sensing

theory

based

Bayesian

inference?

Big

Data

architecture

for

large

scale

sensory

data

management?

Ontology

for

the

background

knowledge

management?

Model

driven

adaptive

observation

support?

Digital

City

and

digital

life

applicationsThe

architecture

the

New

WikiSensing

SystemOntology

Used

Organise

the

Complex

knowledgemanagementUsing

ontology

represent

the

targets,

signals,sensing

methods,

measurements,

etc.Ontology

support

flexible

resolution

Upper

ontology

for

unified

operationOntoSensorConclusion?

Big

data

offers

great

opportunity

for

building

smart

models?

Big

data

provides

new

methodology

for

model

research?

New

informatics

comes

from

the

coupled

integration

the

data

and

the

model

worlds?

Bayesian

theory

provides

nature

foundation

for

such

integration?

Sensor

Informatics

good

example

for

such

paradigm?

new

uniform

framework

sensor

informatics

can

developed

based

the

Bayesian

theory

wherethe

dynamics

data

and

model

capturing

the

essence

building

sensory

system?

are

developing

the

WikiSensing

system

realise

this

paradigmThank

youUnderstanding

Big

DataHaixun

WangData

ExplosionMB

106

bytesa

typical

book

text

formatGB

109

bytesa

one

hour

video

about

1GB;data

produced

biologyexperiment

one

dayTB

1012

bytesastronomy

data

one

night;US

Library

Congress

has

1000

data;search

log

Bing

per

day

(2009)The

Arecibo

TelescopeWorld’s

largest

radio

telescopeDiameter

305

(1,000

ft)Area

acresLocation:

Arecibo,

Puerto

RicoThe

P-ALFA

surveys800

Terabytes

yearsSoftware

Driven

Telescopefrom

few,

large,

expensive,directional

dishes

many,

small,cheap,

omni

directional

antennaea

large

number

high-speedinput

streams(2Gbps

per

antenna,

25,000antennae

area

340

indiameter)Data

sizeChallenge

It’s

the

data,

stupid!Data

complexityKey/value

storeColumn

storeDocument

storeGraph

SystemsBig

data

drives

tomorrow’s

economy.?

The

value

big

data

lies

its

degree

ofconnectedness.?

Existing

systems

cannot

handle

richconnectedness

big

data.RDBMS

and

Rich

Relationships?

Performance

multi-way

joins

very

poor

inRDBMS?

Managing

data

rich

connectedness

requiresmulti-way

Joins

RDBMSTrinity?

general

purpose,

distributed,

memory

graph

system?

Online

graph

query

processing?

Offline

graph

analyticsTrinity

Performance

Highlight?

Onlinequeryprocessing

:–

visiting

2.2

million

users

hop

neighborhood)

Facebook:

100ms–

foundation

for

graph-based

service,

e.g.,

entity

search?

Offlinegraphanalytics

:–

one

iteration

billion

node

graph:

60sec–

foundation

for

analytics,

e.g.,

social

analyticsPeopleSearchDemoMulti-way

Join

vs.

Graph

TraversalCompanyIncidentProblem…IDCompanyID1ID2ID…IncidentID3ID4ID…ProblemRDBMSTrinityChallenge

Interpretation

Big

Data?

IBM

Watson:–

Runs

2,880

cores,

terabytes

RAM,

and80kW

power?

human

brain:–

Runs

tuna

fish

sandwich

and

glass

wateransweringthe

questionunconstrainednatural

languageinferencing

&reasoningdomain

specificlanguagesimplecalculation

Human(Turing

Test)SIRI

Watson

Wolfram

AlphaGoogle/Bing?

the

Eternal

Questunderstanding

the

question

SQLcalculatorTurning

the

Web

intoa

DatabaseWhat

you

see

when

you

look

homepage

…Haixun

WangMicrosoft

Research

AsiaEmail:

haixunw

microsoft

comTel:

+86-10-58963289Tel:

+1-914-902-0749I

joined

Microsoft

Research

Asia

2009.I

was

with

IBM

Watson

ResearchCenter

from

2000

2009.

received

theB.S.

and

M.S.

Degree

Computer

Sciencefrom

ShanghaiJiaoTongUniversity

in1994

and

1996,

the

Ph.D.

Degree

inComputer

Science

fromUniversityofCalifornia,LosAngelesin

June,

2000.AWhat

machine

sees

when

looks

homepage

…A

JPEG

Imagea

jpeg

Filetext

bigA

bold

fontA4

lines

textanother

dozen

lines

oftext

with

twoembedded

URLsSemantic

Web??

Number

trend

2008–

Richard

MacManus?

The

infrastructure

power

theSemantic

Web

already

here.–

Tim

Berners-Lee?

Unstructured

information

will

give

way

structuredinformation

–

paving

the

road

intelligent

computing.–

Alex

IskoldMore

data

beats

better

algorithmsBanko

and

Brill

2001Mean

translation

quality(1=incomprehensible,

perfect)English-Spanish

translation

quality,Microsoft

technical

texts2.5

23.52001200220032004200520062007Systran

Improvealgorithms,

scale

system,and

add

data!Rule-based

system

with

expensive

custo

人人文庫> 全部分類> 專業(yè)文獻(xiàn) > 醫(yī)學(xué)資料

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

專題論壇：大數(shù)據(jù)

文檔簡介

溫馨提示

最新文檔

評論

專題論壇：大數(shù)據(jù)

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔