2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術(shù)白皮書(英文版)-高通_第1頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術(shù)白皮書(英文版)-高通_第2頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術(shù)白皮書(英文版)-高通_第3頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術(shù)白皮書(英文版)-高通_第4頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術(shù)白皮書(英文版)-高通_第5頁
已閱讀5頁,還剩58頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

are

Integratingsenses:

HowAIislearningto

see,hear,andinteract

RolandMemisevic

SeniorDirectorofEngineeringatQualcommAIResearch

JointworkwithSunnyPanchal,ApratimBhattacharyya,

GuillaumeBerger,AntoineMercier,RezaPourreza,SanjayHaresh,andothers

September24,2024

SnapdragonandQualcommbrandedproductsareproductsofQualcommTechnologies,Inc.and/oritssubsidiaries.

Agenda

?Keyconcept:streamingarchitecture

?Importanceofdatasetsforend-to-endtraining

?Efficienthuman-AIinteractionandvideo-basedreasoning

?ImprovingstreamingvideoLLMsusingauxiliarytasks

?Q&A

2

MODALITYANDUSECASECAPABILITYANDKPI

Longercontextwindow

Allowsin-depthconversations

VoiceUI

Voiceisanaturalandintuitiveinterfaceforconversation

GenerativeAI

Largemultimodalmodelso

Utilizingmoresensinginputmodalitiestobetterunderstandtheworld

oPersonalization

Fine-tunedmodelscustomizedtoconsumers,enterprises,orindustries(e.g.,LoRA)

capabilities

continueto

increase

Higherresolution

Processhigherfidelityimagesforbetteraccuracy

Video&3D

Generatingcontentforaricherandmorerealisticexperience

Agents

Executemulti-steptaskswithreasoningautonomouslytoachieveagoal

3

LoRA:low-rankadaptation

Full-stackAIoptimization

forLMs

Runscompletely

onthedevice

Significantlyreduces

runtimelatencyandpowerconsumption

Continuouslyimproves

theQualcomm?AIStack

LM:Languagevisionmodel

Designinganefficientdiffusionmodelthroughknowledge

distillationforhighaccuracy

Knowledgedistillationforpruningandremovingofattentionblocks,resultinginaccuratemodelwithimproved

performanceandpowerefficiency

Qualcomm?AIEnginedirect

forimprovedperformanceandminimizedmemoryspillage

AIaccelerationontheQualcomm?HexagonmNPUoftheSnapdragon?8Gen3MobileProcessor

4

HybridAI

Distributeworkloadsamongcloudand

edge/devicestodelivermorepowerful,

efficient,andhighlyoptimizedexperiences

Centralcloud

Easeofdevelopment&deploymentTraining|Verylargemodels

Aggregation|Absoluteperformance

Edgecloud(on-premornearby)

Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation

On

device

Immediacy|Reliability|Personalization|Privacy|SecurityCost|Energy

Toscale,thecenterofgravityofAIprocessingismovingtotheedge

5

World’sfirst

largemultimodalmodel(LMM)

onan

Androidphone

LLM:LargeLanguageModel;LLaVA:LargeLanguageandVisionAssistant

LLMscannowsee

7+billionparameterLMM,LLaVA,

withtext,speech,andimageinputs

Multi-turnintuitiveconversationsaboutanimageataresponsive

tokenrate

Full-stackAIoptimization

toachievehighperformanceatlowpower

Enhancedprivacy,reliability,personalization,andcostwithon-deviceprocessing

6

7

Goal:TrainingAImodelstoseeandinteractwithhumans

SMARTHOMEMOBILEROBOTICS

8

Visually-groundedLLM

Vision

Action

recognition

Orchestrator

Situatedvision-languagemodels

?Processalivevideostreaminrealtimeanddynamicallyinteractwithusers

LLM

?Determinewhattosayandwhentosayit

Frontend

?Enablethepathtohumanoids

TTS

Open-ended,asynchronous

interactionwithsituatedagentsisanopenchallenge

?Limitedtoturn-basedinteractionsaboutofflinedocumentsorimages

?Limitedtocapturingmomentarysnapshotsofrealityin

aVQA-styledialogue

Researchingvisually-groundedLLMswiththeabilitytoreasonandinteractwiththeenvironment

WhattoSayandWhentoSayit:Video-LanguageModelandBenchmarkforSituatedInteractions(2024);OpenEQA:EmbodiedQuestionAnsweringintheEraofFoundationModels(2024);VQA:visualquestionanswering9

201020122014

SPEECHTOTEXT

Audio

Pipeline

Neuralnetwork

Text

OBJECT

RECOGNITION

Pixels

Pipeline

Neuralnetwork

Objects

LANGUAGE

TRANSLATION

English

Pipeline

Neuralnetwork

French

Neuralnetworkshavereplacedincreasinglycomplexcomputationalpipelines

10

INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM

End-to-endbackpropforagents

11

Keyconcept:

Multi-modalstreamingarchitecture

INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM

TRAINEDEND-TO-END

EXTERNALINPUT

(e.g.,camera)

LANGUAGEORACTIONS

AUTO-REGRESSIVELLM

?Anauto-regressivelanguage

modelisausefulcomponent

ofamulti-modalagentbecauseitisalreadyabletoperform

adialoguewithauser

?Additionally,languagemakes

iteasytoencodesurrogatetasksforadegreeof“commonsense”toemerge

End-to-endlearningrequiresa

multi-modalstreamingarchitecture

13

End-to-endlearning

requiresa

multi-modalstreaming

architecture

AUTO-REGRESSIVELLM

LANGUAGEORACTIONS

EXTERNAL

INPUT

(e.g.,camera)

CONTEXTWINDOW

F

T

T

T

F

T

T

T

F

T

T

R

O

O

O

R

O

O

O

R

O

O

A

K

K

K

A

K

K

K

A

K

K

M

E

E

E

M

E

E

E

M

E

E

E

N

N

N

E

N

N

N

E

N

N

?Visualfoundationmodelsthatcombineanimagefeatureextractorwithalanguagemodelback-bonehavebecomeincreasinglycommon

?Therearemultipledifferentwaystocombinevisualinformationwithlanguagemodeltokes,e.g.:

?Cross-attention(e.g.,Flamingo)

?Dedicatedvisiontokens(e.g.,Llava)

…goodforapplicationslikeCaptioningandVisualQuestionAnswering

However,…

…aliveagentthatcanutilizeareal-timecamerafeed

requiresasystemthatcancontinuouslyattendtovisualinput

?Challenges:

?Freelyinterleavedvisionframesandlanguagetokens

?Dependencesbetweenvisionframe-rateandtokenrate

?Trainingdata,allowingamodeltolearnwhattosayandwhen

?Recentwork:“VideoLLM-online:OnlineVideoLargeLanguageModelforStreamingVideo“,Chenetal.,2024andourwork,whichIwillpresentinthenextslides

14

Flamingo:aVisualLanguageModelforFew-ShotLearning”,Alayracetal2022“VisualInstructionTuning”,Liuetal.2023

Importanceofdatasetsforend-to-endtraining

16

Datasetsforend-to-endtrainingofvisualassistants

Keyrequirementforend-to-endtraining:

alignedvideofeed(frames)+assistant’scomments(tokens)

“HoloAssist:anEgocentric

HumanInteractionDatasetforInteractiveAIAssistantsintheRealWorld”

Wangetal.2024

1stpersonvideosshowingavarietyoftasks(20tasksacross16objects)

“CanFoundationModelsWatch,TalkandGuideYouStepbySteptoMakeaCake?”

Baoetal.2023

1stpersonvideosshowingpreparationofcupcakes

“LiveFitnessCoachingasaTestbedfor

SituatedInteractions”

Panchaletal.2024

3rdpersonvideosshowingfitnessexercisesandtheircorrections

Fitnessquestionsdataset

148

300k

470+

exercises

short-clipvideos

hours

1900

unique

participants

1.1M+

high-level

question-answerpairs

400k+

fine-grained

question-answerpairs

FIT-Coach

benchmarkanddataset

Fitnessfeedbackdataset

9+

hoursoffitness

coachingsession

148

exercisesessions

~3.5

minutes

longsessionswith5to6

exercises

21

unique

participants

Anovelinteractivevisualcoachingbenchmarkanddatasetasatest-bedforreal-time,real-world

situatedinteraction

Aimedatthedevelopmentofinteractivemulti-modalvision-language

modelsbasedinthecontrolledbutchallengingfitnesscoachingdomain

LiveFitnessCoachingasaTestbedforSituatedInteraction,Panchal,Bhattacharyya,etal.202417

18

Fitnessassistantdatasetandbenchmark

Shortvideoclipsshowingtheuserperformingindividualexercises,

alongwithlabelsforperformanceandcommonmistakes(~300kclipsofduration~5-10secondseach)

Long-rangevideosshowingtheuserexercising,alongwithalignedcommentsbythecoach

(~200sessionsacross5-6exerciseseach)

Numberofvideos

UniqueParticipants

AverageDuration(s)

ExercisesperVideo

TotalNumberofExercises

TotalClasses

SHORTCLIPS

LONG-RANGE

Train

Test

Train

Testt

290,775

1,800+5.6±1.1

1

148

1866

16,429

100

5.6±1.2

1

148

1690

153

21

213.4±3.1

5-6

23

69

7

213.7±3.3

5-6

23

FitnessQuestions

TotalHigh-levelQuestions

TotalFine-grainedQuestions

1,193,056

404,082

78,390

80,694

FitnessFeedbacks

AverageFeedbacksperExercise

AverageSilencePeriod(s)tt

AverageFeedbackLength(words)

2.0±10.1n/a

9.0±6.1

2.4±6.9

n/a

9.1±5.0

5.0±1.35.2±1.46.3±3.8

5.0±1.25.3±1.26.6±4.0

19

Fitnessassistantdatasetandbenchmark

LongfitnesssessionsdatasetShortfitnessclipsdataset

20

OurdatasetmeetsalltheneedsofinteractiveAIassistants

DATASET

DOMAIN

HUMANACTIONS

INTERACTIVE

MISTAKES

CORRECTIVEFEEDBACKS

DOMAINEXPERTISE

LENGTH

ActionRecognitionDatasets

NTURGB+D

FineGym

Fitness

Fitness

x

x

x

x

x

x

708

ProceduralActivityDatasets

YouCook2

Cooking

x

x

x

x

x

176

Epic-Kitchens

Cooking

x

x

x

x

x

100

HowTo100M

Daily-life

x

x

x

x

134k

Ego-4D

Daily-life

x

x

x

x

x

3670

Ego-Exo4D

Daily-life

x

x

x

x

1422

Assembly-101

Toyassm.

x

x

x

x

513

InteractiveAIAssistantDatasets

WTAG

Cooking

x

x

x

10

HoloAssist

Obj.manip.

x

x

x

166

QEVD(Ours)

Fitness

474

Efficienthuman-AIinteractionandvideo-basedreasoning

22

Detailedarchitecture:

Learningwhattosayandwhentosayit

AUTO-REGRESSIVELLM

Visualstream

PROMPT

LANGUAGEBACKBONE

EXTERNALINPUT

(e.g.,camera)

LANGUAGEORACTIONS

SELF-ATTN

SELF-ATTN

SELF-ATTN

!!!

3DCNN

SELF-ATTN

CROSS-ATTN4…

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<next>

SELF-ATTN

CROSS-ATTN

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<feedback>

3DCNN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

smooth

SELF-ATTN

CROSS-ATTN

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<next>

3DCNN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

onooo

AUTO-REGRESSIVELLM

Steppablecausal3dconvolutions

enableefficientstreamingmotionperception

Existingvision

languagemodelsusea2dCNNor

visiontransformerasthevisual

featureextractor

Thismakesthemunsuitablefortaskssuchasfitnesscoaching,whichinvolveunderstanding

ofhumanbehaviorsand

motionpatterns

EXTERNAL

INPUT

(e.g.,camera)

LANGUAGEORACTIONS

Weusea3dCNNasthefeatureextractor,whichwehaveshowntobewell-suitedtoend-to-end

learning(“Isend-to-endlearningenoughforfitnessactivity

recognition?”,Mercieretal.2023)

Efficientvisualstreamingatinferencetimecanbeenabledusing

SteppableConv

PreviousNew

steppable,causalconvolutions:

StandardConv

Enhanceyourappwiththe

abilitytosee&interactwith

humansviaanyRGBcamera:

/quic/sense

CausalConv

timestepstimestep

“Isend-to-endlearningenoughforfitnessactivityrecognition?”,Mercieretal.202323

ImprovingstreamingvideoLLMsusingauxiliarytasks

Languagegenerationisnotonlyausefultask,butitalsohelpsamodelacquireadegreeof“commonsense”

Usingalanguagedecodertoprovidesurrogatetaskstothemodelattrainingtime

Pre-trainingamodelon

adifficultcaptioningtask(Something-something

byGoyaletal.2017)…

…allowsustoimprovepredictionaccuracyonaseparateHomeCookingTask:

“Ontheeffectivenessoftaskgranularityfortransferlearning”(Mahdisoltani,etal.2018)

Generatingcomplextextualdescriptions

Generatingsimpletextualdescriptions

Classificationon

178classactions

Classificationon

40actiongroups

Baselineclassification

onimages

Trainingfromscratch

7,7

34,3

59,7

55,8

62,8

47,1

54,4

*“Thesomething-somethingvideodatabaseforlearningandevaluatingvisualcommonsense”(Goyaletal.2017)25

26

Avision-languagemodelcanlearnlow-levelvisualskillsbyencodingvisualinformationaslanguage

Encodingvisualinformationaslanguage

isanaturalwaytoteachavision-languagemodellow-levelvisualskills,suchas

objectidentification,detection,etc.

Theuseofthesevisualskillsatinferencetimeislikeperformingchain-of-thoughtreasoningforvisualinferencetasks

“Look,RememberandReason:Groundedreasoninginvideoswithlanguagemodels”

Bhattacharyya,etal.2024

13

18

18

21

21

33

3

21

21

33

3

21

21

33

3

21

Method

StaticCamera

MovingCamera

Top1

Top5

Top1

Top5

ALOE(Dinget.Al.)

74.0

94.0

59.7

90.1

TFCV3D(Zhanget.al.)

79.7

95.5

-

-

LRR(w/oSurrogateTasks)

68.5

88.7

62.7

86.7

LRR(fine-tuned)

84.1

97.2

80.4

96.7

LRR(joint)

81.0

97.3

73.7

95.6

Example:Something-Else(Materzynskaetal.,2020):Example:CATER(Girdharetal.,2020):

Method

Base

Compositional

Top1

Top5

Top1

Top5

STIN+OIE+NL(Materzynskaetal.,2020,MIT)

78.1

94.5

56.2

81.3

Video-ChatGPT(Maazetal.,2023)

52.6

75.8

38.6

67.8

LRR(w/oSurrogateTasks)

52.6

75.8

50.1

70.8

LRR(fine-tuned)

80.2

96.1

62.0

86.3

LRR(joint)

-

-

61.1

85.4

Stochasticprobingallowsustodistillvisualskillsintothemodel

?Encodingtheextractedlow-levelinformationastokensgrowsthecontextwindowanditcanbeinefficient

?Relyingonexplicitrepresentationsoflow-levelcomputervision

features(suchasboundingboxpositions)mayalsoleadtobrittleness

?Wethereforeproposetodistilllow-levelvisualskillsintothemodelusingaprocesswerefertoasStochasticProbing:

Stochasticprobing:Duringtraining,promptamodelatrandomtime-stepstoperformlow-levelvisualtasks

ACRE

Compositional

Systematic

InferenceSpeed*(sec)

ALOE(Dinget.Al.)

LRR

LRR(StochasticProbing)

91.7

99.3

93.9

99.5

99.2

-

0.061

1.415

98.2

*timingonanA100GPU

Stochasticprobingboostsefficiencyatinferencetime

Trainingonvisualskillscanboostperformanceoverclassicapproaches

27

Asimilarapproach:“DistillingStep-by-Step!OutperformingLargerLanguageModelswithLessTrainingDataandSmallerModelSizes”,Hsie,etal.,2023

End-to-endtraininginconjunctionwithstochasticprobingallowsamodeltoprovideusefulandaccuratefeedbackinreal-time

28

29

Qualitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback

Question:Provideanappropriatefeedbackfortheuser

Video-LLaMA:Weseeayoungmanstandinginakitchen,wearingaredshirtandwhiteshorts.

Video-ChatGPT:Theuserhassuccessfullydemonstratedtheabilitytoperformabalancingactonapairofstools.

Coach-LLaMA:Thisisawesome.Let’skeeptheintensityhigh!

Groundtruth

Stream-VLM

LLaMA-VID

LLaVA-Next

30

Quantitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback

Zero-shotpromptingresults:

METHOD

METEOR

ROUGE-L

BERT

LLM-Acc.

InstructBLIP

0.047

0.040

0.839

1.64

Video-LLaVA

0.057

0.025

0.847

1.82

Video-ChatGPT

0.098

0.078

0.850

2.27

Video-LLaMA

0.101

0.077

0.859

2.28

LLaMA-VID

0.100

0.079

0.859

2.33

LLaVA-Next

0.104

0.078

0.858

2.39

Fine-tuningresults:

METHOD

METEOR

ROUGE-L

BERT

LLM-Acc.

T-F-Score

Socratic-Llama-2-7B

0.094

0.071

0.860

2.39

0.50t

Video-ChatGPT*

0.108

0.093

0.863

2.42

0.50t

LLaMA-VID*

0.106

0.090

0.860

2.40

0.50t

STREAM-VLM

0.125

0.116

0.863

2.56

0.59

STREAM-VLM(w/o3DCNN)

0.090

0.083

0.857

2.17

0.51

STREAM-VLM(w/oAction-Tokens

0.125

0.110

0.861

2.56

0.50t

31

Outlook:CLEVRskillsdatasetforroboticsfoundationmodels

DATASET/SIMULATOR

#TASKS

LANGUAGE

MULTIMODALPROMPTS

ACTIONGRANULARITY

COMPOSITIONALITY

#DEMONSTRATIONS

Real

RoboTurk

3

x

x

ActionDeltas

x

111hrs

BridgeData

71

x

x

ActionDeltas

x

7.2k

Open-X

x

ActionDeltas

x

1M

RH20T

x

ActionDeltas

x

100k

FMB

7

x

x

ActionDeltas

22.5k

Simulated

CALVIN

34

x

ActionDeltas

√t

Behaviour-1K

1000

x

x

ActionDeltas

x

Maniskill2

20

x

x

ActionDeltas

x

≈70k

VIMA

17

Poses

x

650k

ClevrSkill(our)

36

ActionDeltas+Poses

330k

RunningAIondevicesavesmemory

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論