版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
are
Integratingsenses:
HowAIislearningto
see,hear,andinteract
RolandMemisevic
SeniorDirectorofEngineeringatQualcommAIResearch
JointworkwithSunnyPanchal,ApratimBhattacharyya,
GuillaumeBerger,AntoineMercier,RezaPourreza,SanjayHaresh,andothers
September24,2024
SnapdragonandQualcommbrandedproductsareproductsofQualcommTechnologies,Inc.and/oritssubsidiaries.
Agenda
?Keyconcept:streamingarchitecture
?Importanceofdatasetsforend-to-endtraining
?Efficienthuman-AIinteractionandvideo-basedreasoning
?ImprovingstreamingvideoLLMsusingauxiliarytasks
?Q&A
2
MODALITYANDUSECASECAPABILITYANDKPI
Longercontextwindow
Allowsin-depthconversations
VoiceUI
Voiceisanaturalandintuitiveinterfaceforconversation
GenerativeAI
Largemultimodalmodelso
Utilizingmoresensinginputmodalitiestobetterunderstandtheworld
oPersonalization
Fine-tunedmodelscustomizedtoconsumers,enterprises,orindustries(e.g.,LoRA)
capabilities
continueto
increase
Higherresolution
Processhigherfidelityimagesforbetteraccuracy
Video&3D
Generatingcontentforaricherandmorerealisticexperience
Agents
Executemulti-steptaskswithreasoningautonomouslytoachieveagoal
3
LoRA:low-rankadaptation
Full-stackAIoptimization
forLMs
Runscompletely
onthedevice
Significantlyreduces
runtimelatencyandpowerconsumption
Continuouslyimproves
theQualcomm?AIStack
LM:Languagevisionmodel
Designinganefficientdiffusionmodelthroughknowledge
distillationforhighaccuracy
Knowledgedistillationforpruningandremovingofattentionblocks,resultinginaccuratemodelwithimproved
performanceandpowerefficiency
Qualcomm?AIEnginedirect
forimprovedperformanceandminimizedmemoryspillage
AIaccelerationontheQualcomm?HexagonmNPUoftheSnapdragon?8Gen3MobileProcessor
4
HybridAI
Distributeworkloadsamongcloudand
edge/devicestodelivermorepowerful,
efficient,andhighlyoptimizedexperiences
Centralcloud
Easeofdevelopment&deploymentTraining|Verylargemodels
Aggregation|Absoluteperformance
Edgecloud(on-premornearby)
Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation
On
device
Immediacy|Reliability|Personalization|Privacy|SecurityCost|Energy
Toscale,thecenterofgravityofAIprocessingismovingtotheedge
5
World’sfirst
largemultimodalmodel(LMM)
onan
Androidphone
LLM:LargeLanguageModel;LLaVA:LargeLanguageandVisionAssistant
LLMscannowsee
7+billionparameterLMM,LLaVA,
withtext,speech,andimageinputs
Multi-turnintuitiveconversationsaboutanimageataresponsive
tokenrate
Full-stackAIoptimization
toachievehighperformanceatlowpower
Enhancedprivacy,reliability,personalization,andcostwithon-deviceprocessing
6
7
Goal:TrainingAImodelstoseeandinteractwithhumans
SMARTHOMEMOBILEROBOTICS
8
Visually-groundedLLM
Vision
Action
recognition
Orchestrator
Situatedvision-languagemodels
?Processalivevideostreaminrealtimeanddynamicallyinteractwithusers
LLM
?Determinewhattosayandwhentosayit
Frontend
?Enablethepathtohumanoids
TTS
Open-ended,asynchronous
interactionwithsituatedagentsisanopenchallenge
?Limitedtoturn-basedinteractionsaboutofflinedocumentsorimages
?Limitedtocapturingmomentarysnapshotsofrealityin
aVQA-styledialogue
Researchingvisually-groundedLLMswiththeabilitytoreasonandinteractwiththeenvironment
WhattoSayandWhentoSayit:Video-LanguageModelandBenchmarkforSituatedInteractions(2024);OpenEQA:EmbodiedQuestionAnsweringintheEraofFoundationModels(2024);VQA:visualquestionanswering9
201020122014
SPEECHTOTEXT
Audio
Pipeline
Neuralnetwork
Text
OBJECT
RECOGNITION
Pixels
Pipeline
Neuralnetwork
Objects
LANGUAGE
TRANSLATION
English
Pipeline
Neuralnetwork
French
Neuralnetworkshavereplacedincreasinglycomplexcomputationalpipelines
10
INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM
End-to-endbackpropforagents
11
Keyconcept:
Multi-modalstreamingarchitecture
INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM
TRAINEDEND-TO-END
EXTERNALINPUT
(e.g.,camera)
LANGUAGEORACTIONS
AUTO-REGRESSIVELLM
?Anauto-regressivelanguage
modelisausefulcomponent
ofamulti-modalagentbecauseitisalreadyabletoperform
adialoguewithauser
?Additionally,languagemakes
iteasytoencodesurrogatetasksforadegreeof“commonsense”toemerge
End-to-endlearningrequiresa
multi-modalstreamingarchitecture
13
End-to-endlearning
requiresa
multi-modalstreaming
architecture
AUTO-REGRESSIVELLM
LANGUAGEORACTIONS
EXTERNAL
INPUT
(e.g.,camera)
CONTEXTWINDOW
F
T
T
T
F
T
T
T
F
T
T
R
O
O
O
R
O
O
O
R
O
O
A
K
K
K
A
K
K
K
A
K
K
M
E
E
E
M
E
E
E
M
E
E
E
N
N
N
E
N
N
N
E
N
N
?Visualfoundationmodelsthatcombineanimagefeatureextractorwithalanguagemodelback-bonehavebecomeincreasinglycommon
?Therearemultipledifferentwaystocombinevisualinformationwithlanguagemodeltokes,e.g.:
?Cross-attention(e.g.,Flamingo)
?Dedicatedvisiontokens(e.g.,Llava)
…goodforapplicationslikeCaptioningandVisualQuestionAnswering
However,…
…aliveagentthatcanutilizeareal-timecamerafeed
requiresasystemthatcancontinuouslyattendtovisualinput
?Challenges:
?Freelyinterleavedvisionframesandlanguagetokens
?Dependencesbetweenvisionframe-rateandtokenrate
?Trainingdata,allowingamodeltolearnwhattosayandwhen
?Recentwork:“VideoLLM-online:OnlineVideoLargeLanguageModelforStreamingVideo“,Chenetal.,2024andourwork,whichIwillpresentinthenextslides
14
Flamingo:aVisualLanguageModelforFew-ShotLearning”,Alayracetal2022“VisualInstructionTuning”,Liuetal.2023
Importanceofdatasetsforend-to-endtraining
16
Datasetsforend-to-endtrainingofvisualassistants
Keyrequirementforend-to-endtraining:
alignedvideofeed(frames)+assistant’scomments(tokens)
“HoloAssist:anEgocentric
HumanInteractionDatasetforInteractiveAIAssistantsintheRealWorld”
Wangetal.2024
1stpersonvideosshowingavarietyoftasks(20tasksacross16objects)
“CanFoundationModelsWatch,TalkandGuideYouStepbySteptoMakeaCake?”
Baoetal.2023
1stpersonvideosshowingpreparationofcupcakes
“LiveFitnessCoachingasaTestbedfor
SituatedInteractions”
Panchaletal.2024
3rdpersonvideosshowingfitnessexercisesandtheircorrections
Fitnessquestionsdataset
148
300k
470+
exercises
short-clipvideos
hours
1900
unique
participants
1.1M+
high-level
question-answerpairs
400k+
fine-grained
question-answerpairs
FIT-Coach
benchmarkanddataset
Fitnessfeedbackdataset
9+
hoursoffitness
coachingsession
148
exercisesessions
~3.5
minutes
longsessionswith5to6
exercises
21
unique
participants
Anovelinteractivevisualcoachingbenchmarkanddatasetasatest-bedforreal-time,real-world
situatedinteraction
Aimedatthedevelopmentofinteractivemulti-modalvision-language
modelsbasedinthecontrolledbutchallengingfitnesscoachingdomain
LiveFitnessCoachingasaTestbedforSituatedInteraction,Panchal,Bhattacharyya,etal.202417
18
Fitnessassistantdatasetandbenchmark
Shortvideoclipsshowingtheuserperformingindividualexercises,
alongwithlabelsforperformanceandcommonmistakes(~300kclipsofduration~5-10secondseach)
Long-rangevideosshowingtheuserexercising,alongwithalignedcommentsbythecoach
(~200sessionsacross5-6exerciseseach)
Numberofvideos
UniqueParticipants
AverageDuration(s)
ExercisesperVideo
TotalNumberofExercises
TotalClasses
SHORTCLIPS
LONG-RANGE
Train
Test
Train
Testt
290,775
1,800+5.6±1.1
1
148
1866
16,429
100
5.6±1.2
1
148
1690
153
21
213.4±3.1
5-6
23
—
69
7
213.7±3.3
5-6
23
—
FitnessQuestions
TotalHigh-levelQuestions
TotalFine-grainedQuestions
1,193,056
404,082
78,390
80,694
—
—
—
—
FitnessFeedbacks
AverageFeedbacksperExercise
AverageSilencePeriod(s)tt
AverageFeedbackLength(words)
2.0±10.1n/a
9.0±6.1
2.4±6.9
n/a
9.1±5.0
5.0±1.35.2±1.46.3±3.8
5.0±1.25.3±1.26.6±4.0
19
Fitnessassistantdatasetandbenchmark
LongfitnesssessionsdatasetShortfitnessclipsdataset
20
OurdatasetmeetsalltheneedsofinteractiveAIassistants
DATASET
DOMAIN
HUMANACTIONS
INTERACTIVE
MISTAKES
CORRECTIVEFEEDBACKS
DOMAINEXPERTISE
LENGTH
ActionRecognitionDatasets
NTURGB+D
FineGym
Fitness
Fitness
√
√
x
x
x
x
x
x
√
√
708
ProceduralActivityDatasets
YouCook2
Cooking
x
x
x
x
x
176
Epic-Kitchens
Cooking
x
x
x
x
x
100
HowTo100M
Daily-life
√
x
x
x
x
134k
Ego-4D
Daily-life
x
x
x
x
x
3670
Ego-Exo4D
Daily-life
x
x
√
x
x
1422
Assembly-101
Toyassm.
x
x
√
x
x
513
InteractiveAIAssistantDatasets
WTAG
Cooking
x
x
√
√
x
10
HoloAssist
Obj.manip.
x
x
√
√
x
166
QEVD(Ours)
Fitness
√
√
√
√
√
474
Efficienthuman-AIinteractionandvideo-basedreasoning
22
Detailedarchitecture:
Learningwhattosayandwhentosayit
AUTO-REGRESSIVELLM
Visualstream
PROMPT
LANGUAGEBACKBONE
EXTERNALINPUT
(e.g.,camera)
LANGUAGEORACTIONS
SELF-ATTN
SELF-ATTN
SELF-ATTN
!!!
3DCNN
SELF-ATTN
CROSS-ATTN4…
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<next>
SELF-ATTN
CROSS-ATTN
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<feedback>
3DCNN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
smooth
SELF-ATTN
CROSS-ATTN
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<next>
3DCNN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
onooo
AUTO-REGRESSIVELLM
Steppablecausal3dconvolutions
enableefficientstreamingmotionperception
Existingvision
languagemodelsusea2dCNNor
visiontransformerasthevisual
featureextractor
Thismakesthemunsuitablefortaskssuchasfitnesscoaching,whichinvolveunderstanding
ofhumanbehaviorsand
motionpatterns
EXTERNAL
INPUT
(e.g.,camera)
LANGUAGEORACTIONS
Weusea3dCNNasthefeatureextractor,whichwehaveshowntobewell-suitedtoend-to-end
learning(“Isend-to-endlearningenoughforfitnessactivity
recognition?”,Mercieretal.2023)
Efficientvisualstreamingatinferencetimecanbeenabledusing
SteppableConv
PreviousNew
steppable,causalconvolutions:
StandardConv
Enhanceyourappwiththe
abilitytosee&interactwith
humansviaanyRGBcamera:
/quic/sense
CausalConv
timestepstimestep
“Isend-to-endlearningenoughforfitnessactivityrecognition?”,Mercieretal.202323
ImprovingstreamingvideoLLMsusingauxiliarytasks
Languagegenerationisnotonlyausefultask,butitalsohelpsamodelacquireadegreeof“commonsense”
Usingalanguagedecodertoprovidesurrogatetaskstothemodelattrainingtime
Pre-trainingamodelon
adifficultcaptioningtask(Something-something
byGoyaletal.2017)…
…allowsustoimprovepredictionaccuracyonaseparateHomeCookingTask:
“Ontheeffectivenessoftaskgranularityfortransferlearning”(Mahdisoltani,etal.2018)
Generatingcomplextextualdescriptions
Generatingsimpletextualdescriptions
Classificationon
178classactions
Classificationon
40actiongroups
Baselineclassification
onimages
Trainingfromscratch
7,7
34,3
59,7
55,8
62,8
47,1
54,4
*“Thesomething-somethingvideodatabaseforlearningandevaluatingvisualcommonsense”(Goyaletal.2017)25
26
Avision-languagemodelcanlearnlow-levelvisualskillsbyencodingvisualinformationaslanguage
Encodingvisualinformationaslanguage
isanaturalwaytoteachavision-languagemodellow-levelvisualskills,suchas
objectidentification,detection,etc.
Theuseofthesevisualskillsatinferencetimeislikeperformingchain-of-thoughtreasoningforvisualinferencetasks
“Look,RememberandReason:Groundedreasoninginvideoswithlanguagemodels”
Bhattacharyya,etal.2024
13
18
18
21
21
33
3
21
21
33
3
21
21
33
3
21
Method
StaticCamera
MovingCamera
Top1
Top5
Top1
Top5
ALOE(Dinget.Al.)
74.0
94.0
59.7
90.1
TFCV3D(Zhanget.al.)
79.7
95.5
-
-
LRR(w/oSurrogateTasks)
68.5
88.7
62.7
86.7
LRR(fine-tuned)
84.1
97.2
80.4
96.7
LRR(joint)
81.0
97.3
73.7
95.6
Example:Something-Else(Materzynskaetal.,2020):Example:CATER(Girdharetal.,2020):
Method
Base
Compositional
Top1
Top5
Top1
Top5
STIN+OIE+NL(Materzynskaetal.,2020,MIT)
78.1
94.5
56.2
81.3
Video-ChatGPT(Maazetal.,2023)
52.6
75.8
38.6
67.8
LRR(w/oSurrogateTasks)
52.6
75.8
50.1
70.8
LRR(fine-tuned)
80.2
96.1
62.0
86.3
LRR(joint)
-
-
61.1
85.4
Stochasticprobingallowsustodistillvisualskillsintothemodel
?Encodingtheextractedlow-levelinformationastokensgrowsthecontextwindowanditcanbeinefficient
?Relyingonexplicitrepresentationsoflow-levelcomputervision
features(suchasboundingboxpositions)mayalsoleadtobrittleness
?Wethereforeproposetodistilllow-levelvisualskillsintothemodelusingaprocesswerefertoasStochasticProbing:
Stochasticprobing:Duringtraining,promptamodelatrandomtime-stepstoperformlow-levelvisualtasks
ACRE
Compositional
Systematic
InferenceSpeed*(sec)
ALOE(Dinget.Al.)
LRR
LRR(StochasticProbing)
91.7
99.3
93.9
99.5
99.2
-
0.061
1.415
98.2
*timingonanA100GPU
Stochasticprobingboostsefficiencyatinferencetime
Trainingonvisualskillscanboostperformanceoverclassicapproaches
27
Asimilarapproach:“DistillingStep-by-Step!OutperformingLargerLanguageModelswithLessTrainingDataandSmallerModelSizes”,Hsie,etal.,2023
End-to-endtraininginconjunctionwithstochasticprobingallowsamodeltoprovideusefulandaccuratefeedbackinreal-time
28
29
Qualitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback
Question:Provideanappropriatefeedbackfortheuser
Video-LLaMA:Weseeayoungmanstandinginakitchen,wearingaredshirtandwhiteshorts.
Video-ChatGPT:Theuserhassuccessfullydemonstratedtheabilitytoperformabalancingactonapairofstools.
Coach-LLaMA:Thisisawesome.Let’skeeptheintensityhigh!
Groundtruth
Stream-VLM
LLaMA-VID
LLaVA-Next
30
Quantitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback
Zero-shotpromptingresults:
METHOD
METEOR
ROUGE-L
BERT
LLM-Acc.
InstructBLIP
0.047
0.040
0.839
1.64
Video-LLaVA
0.057
0.025
0.847
1.82
Video-ChatGPT
0.098
0.078
0.850
2.27
Video-LLaMA
0.101
0.077
0.859
2.28
LLaMA-VID
0.100
0.079
0.859
2.33
LLaVA-Next
0.104
0.078
0.858
2.39
Fine-tuningresults:
METHOD
METEOR
ROUGE-L
BERT
LLM-Acc.
T-F-Score
Socratic-Llama-2-7B
0.094
0.071
0.860
2.39
0.50t
Video-ChatGPT*
0.108
0.093
0.863
2.42
0.50t
LLaMA-VID*
0.106
0.090
0.860
2.40
0.50t
STREAM-VLM
0.125
0.116
0.863
2.56
0.59
STREAM-VLM(w/o3DCNN)
0.090
0.083
0.857
2.17
0.51
STREAM-VLM(w/oAction-Tokens
0.125
0.110
0.861
2.56
0.50t
31
Outlook:CLEVRskillsdatasetforroboticsfoundationmodels
DATASET/SIMULATOR
#TASKS
LANGUAGE
MULTIMODALPROMPTS
ACTIONGRANULARITY
COMPOSITIONALITY
#DEMONSTRATIONS
Real
RoboTurk
3
x
x
ActionDeltas
x
111hrs
BridgeData
71
x
x
ActionDeltas
x
7.2k
Open-X
√
x
ActionDeltas
x
1M
RH20T
√
x
ActionDeltas
x
100k
FMB
7
x
x
ActionDeltas
√
22.5k
Simulated
CALVIN
34
√
x
ActionDeltas
√t
—
Behaviour-1K
1000
x
x
ActionDeltas
x
—
Maniskill2
20
x
x
ActionDeltas
x
≈70k
VIMA
17
√
√
Poses
x
650k
ClevrSkill(our)
36
√
√
ActionDeltas+Poses
√
330k
RunningAIondevicesavesmemory
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 安全技術(shù)服務(wù)承包責(zé)任書
- 信報箱采購合作協(xié)議
- 購銷合同的合同糾紛
- 汽車保養(yǎng)加盟合同范本
- 家庭養(yǎng)老全程陪伴
- 施工單位分包合同范本
- 還建房購買合同協(xié)議書
- 抖音服務(wù)合同簽訂流程詳解
- 購銷合同樣式設(shè)計指南
- 藝人簽約演出代理
- 《隧道工程監(jiān)控量測》課件
- 2024-2025學(xué)年上學(xué)期廣州小學(xué)語文五年級期末模擬試卷
- 2024年標(biāo)準(zhǔn)化食堂食材采購綜合協(xié)議范本版
- xx單位政務(wù)云商用密碼應(yīng)用方案V2.0
- 《西方經(jīng)濟(jì)學(xué)(本)》形考任務(wù)(1-6)試題答案解析
- 北師大版五年級上冊數(shù)學(xué)期末測試卷及答案共5套
- 兒童社區(qū)獲得性肺炎管理指南(2024修訂)解讀
- 《人體解剖與組織胚胎學(xué)》學(xué)習(xí)通超星期末考試答案章節(jié)答案2024年
- 2024-2025學(xué)年人教版生物八年級上冊期末綜合測試卷
- 北師大版六年級上冊數(shù)學(xué)《總復(fù)習(xí)》課件
- 2023-2024學(xué)年四川省成都市高一上英語期末考試題(含答案和音頻)
評論
0/150
提交評論