




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
Big
Data
vs
Smart
Model:Beauty
and
the
BeastProf.
Yike
GuoDepartment
of
ComputingImperial
College
LondonModel
:
Mathematical
Representation
of
a
SimplifiedPhysical
World
Modelling
is
an
essential
and
inseparable
part
of
all
scientific
activity.
A
scientific
model
seeks
to
represent
empirical
objects,
phenomena,
and
physical
processes
in
a
logical
and
objective
way
To
understand
the
world
or
an
object
(called
a
target
T),
a
modelM
is
a
simplified
mathematical
representation
of
it.
Model
is
the
result
of
abstraction
from
observations
made,
and
it’s
used
to
give
prediction
Human
/
SensorHuman
/
Machine
Human
/
Machine.No
Model
Is
Perfect:
?
Inherent
Uncertainty
:
These
targets
consist
of
a
set
of
continuous
phenomena
(in
both
time
and
space),
and
they
typically
produce
rich
signals.
Because
of
the
continuity
in
both
time
and
space
of
target,
the
signals
are
in
principle
infinite.
But
observations
(
e.g.
sensor
readings
)
are
made
at
discrete
points
in
time
and
space,
so
they
are
incomprehensive,
and
approximate,
which
brings
the
“uncertainty”.
?
Overfitting
or
Underfitting:
When
learning
a
model
from
observations,
such
as
learning
a
nonlinear
regression
model,
we
need
to
choose
the
parameters
such
as
K.
Considering
the
fact
that
the
information
from
observations
is
partial
.
It
is
hard
to
make
a
perfect
choice
of
K.
Such
imperfectness
causes
the
problem
of
model
error,
like
underfitting
(small
k)
and
overfitting
(large
k).?
Simplification:
From
observations,
we
project
from
a
multi-dimensional
world
a
simplified
model
with
significant
reduced
dimensionality
to
focus
on
the
features
or
properties
we
are
interested
in.Nonlinearregression:
K-order
polynomial
George
Box
(statistician)
“All
models
are
wrong,
but
some
areuseful.”
Only
models,
from
cosmological
equations
to
theories
of
humanbehavior,
seemed
to
be
able
to
consistently,
if
imperfectly,
explain
the
worldaround
us.
1980
Peter
Norvig
(Google)
:
"All
models
are
wrong,
and
increasinglyyou
can
succeed
without
them."
2008
Chris
Anderson
(Wired)
:
There
is
now
a
better
way.
Petabytesallow
us
to
say:
"Correlation
is
enough."
We
can
stop
looking
for
models.We
can
analyze
the
data
without
hypotheses
about
what
it
might
show.
Wecan
throw
the
numbers
into
the
biggest
computing
clusters
the
world
hasever
seen
and
let
statistical
algorithms
find
patterns
where
science
cannot.(The
Data
Deluge
Makes
the
Scientific
Method
Obsolete)20124So,
Why
Model
?The
ArgumentAt
the
petabyte
scale,
information
is
not
a
matter
of
simple
three-
and
four-dimensionaltaxonomy
and
order
but
of
dimensionally
agnostic
statistics.
It
calls
for
an
entirely
differentapproach,
one
that
requires
us
to
lose
the
tether
of
data
as
something
that
can
be
visualizedin
its
totality.
It
forces
us
to
view
data
mathematically
first
and
establish
a
context
for
it
later.For
instance,
conquered
the
advertising
world
with
nothing
more
than
appliedmathematics.
It
didn't
pretend
to
know
anything
about
the
culture
and
conventions
ofadvertising
—
it
just
assumed
that
better
data,
with
better
analytical
tools,
would
win
the
day.And
was
right.Google's
founding
philosophy
is
that
we
don't
know
why
this
page
is
better
than
thatone:
If
the
statistics
of
incoming
links
say
it
is,
that's
good
enough.
No
semantic
orcausal
analysis
is
required.
That's
why
can
translate
languages
without
actually"knowing"
them
(given
equal
corpus
data,
can
translate
Klingon
into
Farsi
aseasily
as
it
can
translate
French
into
German).
And
why
it
can
match
ads
to
contentwithout
any
knowledge
or
assumptions
about
the
ads
or
the
content.Model
Free
Sensor
Informatics
:
Query
Driventime10am10am
..10amid12..7temp
20
21
…
29Database
Table
raw-dataSensorNetwork3.
Write
output
to
a
file/back
to
the
database4.
Write
data
processing
tools
to
process/aggregate
the
output
(maybe
using
User1.
Extract
all
readings
into
a
file2.
Run
MATLAB/R/other
data
processing
tools
DB)
5.
Decide
new
data
to
acquire
RepeatModel-free
sensing
treats
the
sensory
system
as
a
database,
and
sensing
as
querying
to
fetch
data
from
physicalworld.
One
of
the
leading
vendors
[Crossbow]
is
bundling
a
query
processor
with
their
devices.Wikisensing
:
A
Model
Free
Sensor
Informatics
SystemBased
on
Big
Data
ArchitectureModel
Free
Sensing
is
Super
Inefficient?
Data
misrepresentation
without
model?
Latent
information
missing
without
model?
High
demand
of
computation/storage
without
model?
Require
too
much
of
interoperability
between
sensorsand
analyticsBayesian:
Data
Is
Not
the
Enemy
of
Models
,
Rather
aGreat
Supporter!Bayesian
probability
is
a
formalism
that
allows
us
to
reason
about
beliefs
of
models
underconditions
of
uncertainty
based
on
the
observations
(data)
.If
we
have
observed
that
a
particular
event
has
happened,
such
as
Britain
coming
10th
in
themedal
table
at
the
2004
Olympics,
then
there
is
no
uncertainty
about
it.However,
suppose
a
is
the
statement
“Britain
sweeps
the
boards
at
2012
London
Olympics,winning
more
than
30
Gold
Medals!“
made
before
28th
of
JulySince
this
is
a
statement
about
a
future
event,
nobody
can
state
with
any
certainty
whether
ornot
it
is
true.
Different
people
may
have
different
beliefs
in
the
statement
depending
on
theirspecific
knowledge
of
factors
that
might
effect
its
likelihoodThe
belief’s
of
the
model
were
changing
daily
based
on
the
performance
data
available
eachday.
By
the
10
of
August,
most
of
people’s
belief
to
this
model
should
be
almost
80%Thus,
in
general,
a
person's
subjective
belief
in
a
statement
a
will
depend
on
some
body
ofknowledge
K.
We
write
this
as
P(a|K).
Henry's
belief
in
a
is
different
from
Marcel's
because
theyare
using
different
K's.
However,
even
if
they
were
using
the
same
K
they
might
still
havedifferent
beliefs
in
a.The
expression
P(a|K)
thus
represents
a
belief
measure.
Sometimes,
for
simplicity,
when
Kremains
constant
we
just
write
P(a),
but
you
must
be
aware
that
this
is
a
simplification.Model
and
Data
Interaction
:
Bayesian
Inference10?Bayes
Rule:
Interaction
between
data
and
model?Learning
as
A
Sequence
of
Interactionsp(Y
|
)
p(
)
p(Y)P(
|
Y)
Big
Data
Meets
Smart
Models
:
A
Bayesian
Approachtowards
Sensor
Informatics?We
need
model
:
a
model
is
the
representation
of
our
knowledge
so
far?????Data
:
the
observations
which
may
revise
our
belief
to
the
models
we
haveAnalysis
:
assessing
our
belief
and
updating
our
models
to
make
them
more
believableSensing
:
acquiring
needed
data
to
update
(enrich)
modelsModels
are
learned
from
data
(observations)
by
scientists
(theoretical
abstraction)
or
by
machine
(machinelearning)
?
Models
are
hypothesis
(
when
making
new
observation)
?
Models
are
knowledge
(when
established
belief)Sensor
Informatics:
Sensing
management
Managing
the
“neediness”
:
when
and
where
to
sense
?
Sensing
analytics
Managing
model
updating
:
how
to
enrich
models
with
observations
?
Reasoning
Decision
making
based
on
integration
of
trusted
models
?P(M
|
D)
=
P(D
|
M
)
P(M)
/
P(D)
Surprising
Event
:
When
an
Observation
Does
not
Fit
a
Known
Model
Posterior
and
prior
(P(M|D)
~
P(M)
)
has
great
variance
->
surprise!How
great
is
great
variance?
Surprise
threshold
αKullback-Leibler
divergence:Other
methods:
signficant
level,
Chebyshev’s
Theorem,
…
From
model,
we
get
C(A,
B)
(e.g.
a
multivariate
Gaussian
distribution)
A:
100mm
B:
50mmModel
consistentA:
100mmB:
500mmSurprise!Camera
example:
Image
->
Analog
Signal
->Digital
Data
->
Compressed
Data
->
InformationWhy
sensing
so
much
data
and
then
throw
themaway?Why
not
sensing
information
directly?Using
Compressive
Sensing
Technology
to
OptimizeObservations
Compressive
sensing:
Take
the
advantage
of
sparseness,
to
solve
the
under-determined
signals
with
just
a
small
amount
of
measurement.
Unobserved
behavior
(behavior
not
captured
by
the
current
model)
is
typically
sparse.Reconstruction
method:
L1-min,
Bayesian
CS.Sensing
data
is
enough
when
we
can
recover
the
need
information
through
compressive
sensing.Ψ:
CS
Matrix
built
from
the
modelΦ:
Placement
MatrixHow
to
Update
Model
–
Parameter
Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC
25
2011
21:15:23NODAL
SOLUTIONSTEP=360SUB
=1TIME=1800TEMP
(AVG)RSYS=0SMN
=131.03SMX
=646.41
MX
MN
Z
XEstimating
parameter
θ
to
maximize
the
likelihoodof
data
given
the
model:Model
:
An
Example
in
Digital
CityModelling
City
Life
via
Causality
:
C(eA,
eB)
is
used
for
predict
current
value
of
location
(A)
whenanother
location
(B)
value
is
given
Location
:
physical
/
logical
locations
with
causality
(through
sensory
cortex)(city
areas,
A.
B)
Relationship
:
topology
(geo
topology
between
A
and
B:
diffusion
Structure
)
Event:
events,
which
is
the
dynamics
of
observable
signal
S
=
f(E)
(heavyrainfall)Ontologies
are
adopted
to
represent
locations
L,
relationships
R*events
E,
and
signals
S.Diffusion:
An
event
e1∈
E
in
n1causes
another
event
e2
∈
E
in
n2,when
two
nodes
n1,
n2
in
G
arelinked.
Digital
City
Model
:
looking
into
the
detailsSystem
T
=
(L,
R,
E)Model
M(T)
=
(G,
?,
B)Training
for
causality
?:
use
Bayesian
network
to
represent
theconditional
independencies
between
cause
and
target
variables:1.
Gaussian
Mixture
Models
(GMMs),
estimated
via
expectationmaximization
(EM)
2.
Gaussian
Process
with
Bayesian
Inference.
When
the
surprise
>
surprise
threshold
Diversity
detected
identify
the
incorrect
causality
C(el,
ep),
which
is
sparse
Compressive
sensing
approachNew
observation->
measurement
thatcould
revise
model
in
model
space
tomaximize
the
likelihood
of
observations
Focusing
on
diversityPlacementModel
Updating
Model
Driven
Sensing
:
No
Surprise
!
The
dynamics
of
model
update:
Surprise
->
Sensing
->
Model
Updating
The
goal
for
sensing:
Capturingsurprise
The
goal
of
analysis
:
RevisingmodelA
model
cannot
overfit
/
underfit,
when
there
is
diversity,
it
could
be
updated->
consistent
with
the
universe
(target)Model
UpdateIt’s
a
Bayesian:
P(M,
?
|
D)
=
P(D
|
M,
?)
P(M,
?)
/
P(D)T:
target,
M:
model,
?:
top-down
parameter*
When
?
is
fixed:
P(M
|
D)
=
P(D
|
M)
P(M)
/
P(D)->
The
variance
between
posterior
and
prior
is
“surprise”->
bottom-up
attention
->
model
update
(data
assimilation):combining
observations
of
the
current
state
of
a
system
with
the
resultsfrom
a
model
(the
forecast)
to
produce
an
analysis.
The
model
is
thenadvanced
in
time
and
its
result
becomes
the
forecast
in
the
nextanalysis
cycle*
When
?
is
updated:
P(M,
?)
=
P(M
|
?)P(?)->
top-down
attention
(alertness)
->
model
updateAdaptive
Observation:
Sensing
and
Numerical
ModellingCityGML
Ontology
->
GIS
->
Geometry
meshBuilding
An
Initial
Model
and
Making
Prediction
bySimulationsSetting
up
boundary
conditions,
numerical
schemas,
model
parameters,
etc.Simulation24
Building
Case
(Fine
Mesh
–
600000
Nodes):
20
ProcessorsSimulationMoving
Vehicles
and
Scalar
Dispersions
in
Street
CanyonsUsing
Sensor
to
Verify
the
Prediction
Results
of
theModel
Sensing:
Acquiring
data
to
get
posterior
of
model,
for
validate
(consistent)
or
update
model
.
P(M
|
D)
=
P(D
|
M)
P(M)
/
P(D)Data
sensingModelvalidateupdateNew
WikiSensing:
Elastic
Sensing
Environment
forLarge
Scale
Sensor
Informatics?
Elastic
sensing
theory
based
on
Bayesian
inference?
Big
Data
architecture
for
large
scale
sensory
data
management?
Ontology
for
the
background
knowledge
management?
Model
driven
adaptive
observation
support?
Digital
City
and
digital
life
applicationsThe
architecture
of
the
New
WikiSensing
SystemOntology
Used
to
Organise
the
Complex
knowledgemanagementUsing
ontology
to
represent
the
targets,
signals,sensing
methods,
measurements,
etc.Ontology
to
support
flexible
resolution
Upper
ontology
for
unified
operationOntoSensorConclusion?
Big
data
offers
great
opportunity
for
building
smart
models?
Big
data
provides
new
methodology
for
model
research?
New
informatics
comes
from
the
close
coupled
integration
of
the
data
and
the
model
worlds?
Bayesian
theory
provides
a
nature
foundation
for
such
an
integration?
Sensor
Informatics
is
a
good
example
for
such
a
paradigm?
A
new
uniform
framework
of
sensor
informatics
can
be
developed
based
on
the
Bayesian
theory
wherethe
dynamics
of
data
and
model
capturing
the
essence
of
building
a
sensory
system?
We
are
developing
the
WikiSensing
system
to
realise
this
paradigmThank
youUnderstanding
Big
DataHaixun
WangData
ExplosionMB
=
106
bytesa
typical
book
in
text
formatGB
=
109
bytesa
one
hour
video
is
about
1GB;data
produced
by
a
biologyexperiment
in
one
dayTB
=
1012
bytesastronomy
data
in
one
night;US
Library
of
Congress
has
1000
TB
data;search
log
of
Bing
is
20
TB
per
day
(2009)The
Arecibo
TelescopeWorld’s
largest
radio
telescopeDiameter
:
305
m
(1,000
ft)Area
:
18
acresLocation:
Arecibo,
Puerto
RicoThe
P-ALFA
surveys800
Terabytes
in
5
yearsSoftware
Driven
Telescopefrom
few,
large,
expensive,directional
dishes
to
many,
small,cheap,
omni
directional
antennaea
large
number
of
high-speedinput
streams(2Gbps
per
antenna,
25,000antennae
in
an
area
of
340
km
indiameter)Data
sizeChallenge
1:
It’s
the
data,
stupid!Data
complexityKey/value
storeColumn
storeDocument
storeGraph
SystemsBig
data
drives
tomorrow’s
economy.?
The
value
of
big
data
lies
in
its
degree
ofconnectedness.?
Existing
systems
cannot
handle
richconnectedness
of
big
data.RDBMS
and
Rich
Relationships?
Performance
of
multi-way
joins
is
very
poor
inRDBMS?
Managing
data
of
rich
connectedness
requiresmulti-way
Joins
in
RDBMSTrinity?
A
general
purpose,
distributed,
in
memory
graph
system?
Online
graph
query
processing?
Offline
graph
analyticsTrinity
Performance
Highlight?
Onlinequeryprocessing
:–
visiting
2.2
million
users
(3
hop
neighborhood)
on
Facebook:
<=
100ms–
foundation
for
graph-based
service,
e.g.,
entity
search?
Offlinegraphanalytics
:–
one
iteration
on
a
1
billion
node
graph:
<=
60sec–
foundation
for
analytics,
e.g.,
social
analyticsPeopleSearchDemoMulti-way
Join
vs.
Graph
TraversalCompanyIncidentProblem…IDCompanyID1ID2ID…IncidentID3ID4ID…ProblemRDBMSTrinityChallenge
2:
Interpretation
of
Big
Data?
IBM
Watson:–
Runs
on
2,880
cores,
15
terabytes
of
RAM,
and80kW
of
power?
A
human
brain:–
Runs
on
a
tuna
fish
sandwich
and
a
glass
of
wateransweringthe
questionunconstrainednatural
languageinferencing
&reasoningdomain
specificlanguagesimplecalculation
Human(Turing
Test)SIRI
Watson
Wolfram
AlphaGoogle/Bing?
the
Eternal
Questunderstanding
the
question
SQLcalculatorTurning
the
Web
intoa
DatabaseWhat
you
see
when
you
look
at
my
homepage
…Haixun
WangMicrosoft
Research
AsiaEmail:
haixunw
@
microsoft
.
comTel:
+86-10-58963289Tel:
+1-914-902-0749I
joined
Microsoft
Research
Asia
in
2009.I
was
with
IBM
T.
J.
Watson
ResearchCenter
from
2000
to
2009.
I
received
theB.S.
and
M.S.
Degree
in
Computer
Sciencefrom
ShanghaiJiaoTongUniversity
in1994
and
1996,
the
Ph.D.
Degree
inComputer
Science
fromUniversityofCalifornia,LosAngelesin
June,
2000.AWhat
a
machine
sees
when
it
looks
at
my
homepage
…A
JPEG
Imagea
jpeg
Filetext
in
bigA
bold
fontA4
lines
of
textanother
dozen
lines
oftext
with
twoembedded
URLsSemantic
Web??
Number
1
trend
in
2008–
Richard
MacManus?
The
infrastructure
to
power
theSemantic
Web
is
already
here.–
Tim
Berners-Lee?
Unstructured
information
will
give
way
to
structuredinformation
–
paving
the
road
to
intelligent
computing.–
Alex
IskoldMore
data
beats
better
algorithmsBanko
and
Brill
2001Mean
translation
quality(1=incomprehensible,
4
=
perfect)English-Spanish
translation
quality,Microsoft
technical
texts2.5
23.52001200220032004200520062007Systran
Improvealgorithms,
scale
system,and
add
data!Rule-based
system
with
expensive
custo
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 窗簾供貨協(xié)議書
- 貓咪照顧協(xié)議書
- 經(jīng)費劃轉(zhuǎn)協(xié)議書
- 退還耕地協(xié)議書
- 自原賠償協(xié)議書
- 機(jī)動地承包合同協(xié)議書
- 股權(quán)合并協(xié)議書
- 環(huán)保處理協(xié)議書
- 比亞迪退車保密協(xié)議書
- 退貨退稅協(xié)議書
- 遴選公務(wù)員筆試真題及答案
- 高瞻課程師幼互動
- 鍋爐檢修作業(yè)安全保障方案
- 中國艾滋病診療指南(2021年版)
- 醫(yī)院培訓(xùn)課件:《急診急救-消化道出血的護(hù)理》
- 三基三嚴(yán)培訓(xùn)課件
- 人教版六年級上冊數(shù)學(xué)百分?jǐn)?shù)應(yīng)用題專題分類復(fù)習(xí)(課件)
- 大學(xué)語文經(jīng)典詩詞鑒賞試題及答案2024
- 上海中考語文知識點語文知識點
- 跨學(xué)科項目的集體備課基本要求
- DB11-T 382-2017 建設(shè)工程監(jiān)理規(guī)程
評論
0/150
提交評論