深度多模態(tài)數(shù)據(jù)融合 Deep Multimodal Data Fusion_第1頁
深度多模態(tài)數(shù)據(jù)融合 Deep Multimodal Data Fusion_第2頁
深度多模態(tài)數(shù)據(jù)融合 Deep Multimodal Data Fusion_第3頁
深度多模態(tài)數(shù)據(jù)融合 Deep Multimodal Data Fusion_第4頁
深度多模態(tài)數(shù)據(jù)融合 Deep Multimodal Data Fusion_第5頁
已閱讀5頁,還剩61頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領

文檔簡介

DeepMultimodalDataFusion

FEIZHAO,

TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA

CHENGCUIZHANG,

TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA

BAOCHENGGENG,

TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA

MultimodalArtificialIntelligence(MultimodalAI),ingeneral,involvesvarioustypesofdata(e.g.,images,texts,ordatacollectedfromdifferentsensors),featureengineering(e.g.,extraction,combination/fusion),anddecision-making(e.g.,majorityvote).Asarchitecturesbecomemoreandmoresophisticated,multimodalneu-ralnetworkscanintegratefeatureextraction,featurefusion,anddecision-makingprocessesintoonesinglemodel.Theboundariesbetweenthoseprocessesareincreasinglyblurred.Theconventionalmultimodaldatafusiontaxonomy(e.g.,early/latefusion),basedonwhichthefusionoccursin,isnolongersuitableforthemoderndeeplearningera.Therefore,basedonthemain-streamtechniquesused,weproposeanewfine-grainedtaxonomygroupingthestate-of-the-art(SOTA)modelsintofiveclasses:Encoder-Decodermethods,AttentionMechanismmethods,GraphNeuralNetworkmethods,GenerativeNeuralNetworkmethods,andotherConstraint-basedmethods.Mostexistingsurveysonmultimodaldatafusionareonlyfocusedononespecifictaskwithacombinationoftwospecificmodalities.Unlikethose,thissurveycoversabroadercombi-nationofmodalities,includingVision+Language(e.g.,videos,texts),Vision+Sensors(e.g.,images,LiDAR),andsoon,andtheircorrespondingtasks(e.g.,videocaptioning,objectdetection).Moreover,acomparisonamongthesemethodsisprovided,aswellaschallengesandfuturedirectionsinthisarea.

CCSConcepts:?Computingmethodologies→Artificialintelligence;Naturallanguageprocessing;Computervision;Machinelearning;

AdditionalKeyWordsandPhrases:Datafusion,neuralnetworks,multimodaldeeplearning

ACMReferenceFormat:

FeiZhao,ChengcuiZhang,andBaochengGeng.2024.DeepMultimodalDataFusion.ACMComput.Surv.56,9,Article216(April2024),36pages.

/10.1145/3649447

1INTRODUCTION

Data,withoutadoubt,isanextremelyimportantcatalystintechnologicaldevelopment,especiallyinArtificialIntelligence(AI)field.Inthelast20years,theamountofdatageneratedinthisperiodaccountsforabout90%ofalldataavailableintheworld.Moreover,therateofdatagrowthisstillaccelerating.TheexplosionofdataprovidesanunprecedentedchanceforAItothrive.

Withtheadvancementofsensortechnologies,notonlytheamountandqualityofdataisin-creasedandenhanced,butthediversityofdataisalsoskyrocketing.Thedatacapturedfromdiffer-entsensorsprovidepeoplewithdistinct“views”or“perspectives”ofthesameobjects,activities,or

Authors’addresses:F.Zhao,TheUniversityofAlabamaatBirmingham,UniversityHall4105,140210thAve.S.,Birm-ingham,AL,35294,USA;e-mail:larry5@;C.Zhang,TheUniversityofAlabamaatBirmingham,UniversityHall4143,140210thAve.S.,Birmingham,AL,35294,USA;e-mail:czhang02@;B.Geng,TheUniversityofAlabamaatBirmingham,UniversityHall4147,140210thAve.S.,Birmingham,AL,35294,USA;e-mail:bgeng@.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponentsofthisworkownedbyothersthantheauthor(s)mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.Requestpermissionsfrom

permissions@.

?2024Copyrightheldbytheowner/author(s).PublicationrightslicensedtoACM.ACM0360-0300/2024/04-ART216

/10.1145/3649447

ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

216:2F.Zhaoetal.

Fig.1.Theworldhasbeenprojectedintomultipledimensions/domains.

phenomena.Inotherwords,peopleareabletoobservethesameobjects,activities,orphenomenaindifferent“dimensions”or“domains”byusingdifferentsensors.Thesenew“views”helppeopleobtainabetterunderstandingoftheworld.Forexample,100yearsago,inthemedicalfield,itwasextremelydifficultforphysicianstodiagnosewhetherapatienthasalungtumorduetothelim-itedwayofobservingorgans.Aftertheinventionofthefirstcomputerizedtomography(CT)scannerbasedonX-raytechnology,thedatacapturedfromthemachineprovidemuchricherinfor-mationaboutlungs,enablingphysicianstomakediagnosesbasedonCTimagesalone.Withtheadvancementoftechnology,magneticresonanceimaging(MRI),amedicalimagingtechniquethatusesstrongmagneticfieldsandradiowaves,hasbeenusedtodetecttumorsaswell.Nowadays,physiciansareabletoaccessmultimodaldataincludingCT,MRI,andbloodtestdata,andsoon.Theaccuracyofdiagnosisbasedonthecombinationofthesedataismuchhigher,comparedwiththatbasedonasinglemodalityalone,e.g.,CT,orMRIonly.ThisisbecausethecomplementaryandredundantinformationamongCT,MRI,andbloodtestdatacanhelpphysiciansbuildamorecomprehensiveviewofanobservedobject,activity,orphenomenon.EvolutionofAIalsofollowsasimilarpath.Initsinfancy,AIonlyfocusesonsolvingproblemsusingasinglemodality.Nowadays,AItoolshavebecomeincreasinglycapableofsolvingreal-worldproblemsbyusingmultimodality.

Whatismultimodality?Inreality,whenweexperiencetheworld,weseeobjects,hearsounds,feeltextures,smellodors,andtasteflavors[

11

].Theworldisrepresentedbyinformationindif-ferentmediums,e.g.,vision,sounds,andtextures.AvisualizationisshowninFigure

1

.Ourrecep-torssuchaseyesandears,helpuscapturetheinformation.Then,ourbrainwillbeabletofusetheinformationfromdifferentreceptorstoformapredictionoradecision.Theinformationob-tainedfromeachsource/mediumcanbeviewedasonemodality.Whenthenumberofmodalitiesisgreaterthanone,wecallitmultimodality.However,insteadofusingeyesandears,machineshighlydependonsensorssuchasRGBcameras,microphones,orothertypesofsensors,asshowninFigure

2.

Eachsensorcanmaptheobservedobjects/activitiesintoitsowndimension.Inotherwords,theobservedobjects/activitiescanbeprojectedintothedimensionofeachsensor.Then,machinesorrobotscancollectthedatafromeachsensorandmakeapredictionordecisionbasedonthem.Intheindustry,therearenumerousapplicationstakingadvantageofmultimodality.Forexample,autonomousvehicle,whichisoneofthehottesttopicssincethe2020s,isatypicalap-plicationrelyingonmultimodality.Suchasystemrequiresmultipletypesofdatafromdifferentsensors,e.g.,LiDARsensors,Radarsensors,cameras,andGPS.Themodelwillfusethesedatatomakereal-timepredictions.Inthemedicalfield,moreandmoreapplicationsrelyonthefusionofmedicalimagingandelectronichealthrecordstoenablemodelstoanalyzeimagingfindingsintheclinicalcontext,e.g.,CTandMRIfusion.

ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

DeepMultimodalDataFusion216:3

Fig.2.Theworldhasbeenprojectedintomultipledimensions/domainsbydifferenttypesofsensors.

Fig.3.Thekidisstrikingadrum.Evenifthedrumisnotvisible,basedonthevisionandaudioinformation,wecanstillrecognizetheactivitycorrectly.

Whydoweneedmultimodality?Ingeneral,multimodaldatarefertothedatacollectedfromdifferentsensors,e.g.,CTimages,MRIimages,andbloodtestdataforthecancerdiagnosis,RGBdataandLiDARdataforautonomousdrivingsystem,RGBdataandinfrareddataforskeletondetectionofKinect[

28

].Forthesameobservedobjectoractivity,thedatafromdifferentmodali-tiescanhavedistinctexpressionsandperspectives.Althoughthecharacteristicsofthesedatacanbeindependentanddistinct,theyoftenoverlapsemantically.Thisphenomenoniscalledinfor-mationredundancy.Furthermore,informationfromdifferentmodalitiescanbecomplementary.Humanscanunconsciouslyfusethemultimodaldata,obtainknowledge,andmakepredictions.Thecomplementaryandredundantinformationextractedfrommultimodalitiescanhelphumansformacomprehensiveunderstandingoftheworld.AstheexampleshowninFigure

3

,whenakidisdrumming,evenifwecannotseethedrum,wearestillabletorecognizeadrumthatisbeingstruckbasedonthesounds.Inthisprocess,weunconsciouslyfusethevisionandacousticdata,andextractthecomplementaryinformationofthem,tomakeacorrectprediction.Ifthereisonlyonemodalityavailable,e.g.,visionmodalitywiththedrumobjectoutofsight,wecanonlytellthatakidiswavingtwosticks.Withonlythesoundavailable,wewouldonlybeabletotellthatadrumisbeingstruckwithoutknowingwhoisdrumming.Therefore,ingeneral,theindependentinterpretationbasedonindividualmodalityonlypresentspartialinformationoftheobservedactivity.However,themultimodality-basedinterpretationcandeliverthe“fullerpicture”oftheobservedactivity,whichcanbemorerobustandreliablethansingle-modality-basedmodels.Forinstance,autonomousvehiclescontainingmultiplesensorssuchasRGBcamerasandLiDARsensors,needtodetectobjectsontheroadinextremeweatherconditionswherevisibilityisnearzero,e.g.,densefogorheavyrain.Amultimodal-basedmodelcanstilldetectobjectswhilethepure-vision-basedmodelscannot.However,itisextremelyhardformachinestounderstandandfigureouthowtofuseandtakeadvantagesofthecomplementarynatureofmultimodaldatatoimprovetheprediction/classificationaccuracy.

ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

216:4F.Zhaoetal.

Howtofusemultimodaldata?Inthe1990s,astraditionalMachineLearning(ML),asubclassofAI,flourished,ML-basedmodelsforaddressingmultimodalproblemsbegantothrive.Itbecamecommonforthemachinetoextractknowledgefrommultimodaldataandmakedecisions.How-ever,backthenmostoftheworkswerefocusedonfeatureengineering,e.g.,howtoobtainabetterrepresentationforeachmodality.Duringthattime,manymodality-specifichand-craftedfeatureextractorswereproposed,whichgreatlyrelyonpriorknowledgeofthespecifictasksandthecor-respondingdata.Sincethesefeatureextractorsworkindependently,theycanhardlycapturethecomplementaryandredundantnatureofmultimodalities.Therefore,suchafeatureengineeringprocessinevitablyresultsinalossofinformationbeforethefeaturesaresenttotheML-basedmodel.ThisleadstoanegativeimpactontheperformanceofthetraditionalML-basedmodels.AlthoughtraditionalML-basedmodelshavetheabilitytoanalyzemultimodalinformation,thereisalongwaytoachievetheultimategoalofAI,whichistomimichumansorevensurpasshumanperformance.Therefore,howtofusethedatainawaythatcanautomaticallylearnthecomple-mentaryandredundantinformationandminimizethemanualinterferenceremainsaprobleminthetraditionalMLfield.

Deeplearningisasub-fieldofML.Itallowscomputationalmodelsthatarecomposedofmultipleprocessinglayerstolearnrepresentationsofdatawithmultiplelevelsofabstraction[

88

].Itskeyadvantageisthatthehierarchicalrepresentationscanbelearnedinanautomatedway,whichdoesnotrequiredomainknowledgeorhumaneffort.Forexample,tothedata:X={x1,x2,...,xN},Y={y1,y2,...,yN},atwo-layerneuralnetworkcanbedefinedasthecombinationofmatricesW,andnon-linearfunctionσ(·)asshowninEquation(

1)

.Aftertrainingprocess,wecanfindWforwhichiisclosetoyiforalli≤N.Asthedepthofthemodelcontinuestoincrease,sodoesitsabilityoffeaturerepresentation.

i(xi)=W2·σ(W1xi).(1)

Since2010,multimodaldatafusionhasenteredthestageofdeeplearninginanall-aroundway.Deeplearning-basedmultimodaldatafusionmethodshavedemonstratedoutstandingre-sultsinvariousapplications.Forvideo-audio-basedmultimodaldatafusion,theworksfrom

[35,

37,

51,

163

]addresstheemotionrecognitionproblembyusingdeeplearningtechniques, includingconvolutionalneuralnetworks,longshort-termmemory(LSTM)networks,at- tentionmechanisms,andsoon.Also,forvideo-textmultimodaldatafusion,theworksfrom

[41,

56,

68,

107,

123,

124,

195

]addressthetext-to-videoretrievaltaskbyusingTransformer,BERT,attentionmechanism,adversariallearning,andacombinationofthem.Therearevariousothermultimodaltasks,e.g.,visualquestionanswering(VQA)(text-image:[

154,

220

],text-video:

[82,

223

]),RGB-depthobjectsegmentation[

31,

39

],medicaldataanalysis[

181,

185

],andimagecaptioning[

216,

237

].ComparedtotraditionalML-basedmethods,deepneuralnetwork(DNN)-basedmethodsshowsuperiorperformanceonrepresentationlearningandmodalityfusioniftheamountofthetrainingdataislargeenough.Furthermore,DNNisabletoexecutefeatureengineer- ingbyitself,whichmeansahierarchicalrepresentationcanbeautomaticallylearnedfromdata, insteadofmanuallydesigningorhandcraftingmodality-specificfeatures.Traditionally,themeth-odsofmultimodaldatafusionareclassifiedintofourcategories,basedontheconventionalfusion taxonomyshowninFigure

4

,includingearlyfusion,intermediatefusion,latefusion,andhybridfusion:(1)earlyfusion:Therawdataorpre-processeddataobtainedfromeachmodalityarefusedbeforebeingsenttothemodel;(2)intermediatefusion:thefeaturesextractedfromdifferentmodal- itiesarefusedtogetherandsenttothemodelfordecisionmaking;(3)latefusion(alsoknownas“decisionfuse”):theindividualdecisionsobtainedfromeachmodalityarefusedtoformthefinalprediction,e.g.,majorityvoteorweightedaverage,orametaMLmodelontopofindividualdeci-

sions.(4)hybridfusion:acombinationofearly,intermediate,andlatefusion.WithlargeamountsACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

DeepMultimodalDataFusion216:5

Fig.4.Theconventionaltaxonomycategorizesfusionmethodsintothreeclasses.

ofmultimodaldataavailable,theneedformoreadvancedmethods(VShandpickedwaysoffusion)tofusethemhasgrownunprecedentedly.However,thisconventionalfusiontaxonomycanonlyprovidebasicguidanceformultimodaldatafusion.Inordertoextractthericherrepresentationfrommultimodaldata,thearchitectureofDNNbecomesmoreandmoresophisticated,whichnolongerextractsfeaturesfromeachmodalityseparatelyandindependently.Instead,representationlearning,modalityfusing,anddecisionmakingareinterlacedinmostcases.Therefore,thereisnoneedtospecifyexactlyinwhichpartofthenetworkthemultimodaldatafusionoccurs.Themethodoffusingmultimodaldatahaschangedfromtraditionalexplicitways,e.g.,earlyfusion,in-termediatefusion,andlatefusion,tomoreimplicitways.ToforcetheDNNtolearnhowtoextractcomplementaryandredundantinformationofmultimodaldata,researchershaveinventedvariousconstraintsonDNN,includingspecificallydesignednetworkarchitecturesandregularizationsonlossfunctions,andsoon.Therefore,thedevelopmentofdeeplearninghassignificantlyreshapedthelandscapeofmultimodaldatafusion,revealingtheinadequaciesofthetraditionaltaxonomyoffusionmethods.Theinherentcomplexityofdeeplearningarchitecturesofteninterlacesrepre-sentationlearning,modalityfusing,anddecision-making,defyingthesimplisticcategorizationsofthepast.Furthermore,theshiftfromexplicittomoreimplicitfusionmethods,exemplifiedbyattentionmechanisms,haschallengedthestaticnatureoftraditionalfusionstrategies.Techniquessuchasgraphneuralnetworks(GNNs)andgenerativeneuralnetworks(GenNNs)intro-ducenovelwaysofhandlingandfusingdatathatarenotalignedwiththeearly-to-latefusionframework.Additionally,thedynamicandadaptivefusioncapabilitiesofdeepmodels,coupledwiththechallengesposedbylarge-scaledata,necessitatemoresophisticatedfusionmethodsthantheconventionalcategoriescanencapsulate.Recognizingthesecomplexitiesandtherapidevolu-tion,itbecomesimperativetointroduceataxonomythatdelvesdeeper,capturingthesubtletiesofcontemporaryfusionmethods.

Formultimodaldatafusion,thereareseveralrecentsurveysavailableinthesciencecommunity.Gaoetal.

[46

]provideareviewonmultimodalneuralnetworksandSOTAarchitectures.However,thereviewisonlyfocusedonanarrowresearcharea:theobjectrecognitiontaskforRGB–depthimages.Moreover,thissurveyislimitedtotheconvolutionalneuralnetworks.Zhangetal.

[235

]presentasurveyondeepmultimodalfusion.However,theauthorscategorizethemodelsusingtheconventionaltaxonomy:earlyfusion,latefusion,andhybridfusion.Furthermore,thissurvey

isfocusedontheimagesegmentationtaskonly.Abduetal.

[2

]providealiteraturereviewofACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

216:6F.Zhaoetal.

Fig.5.Thediagramofourproposedfine-grainedtaxonomyofdeepmultimodaldatafusionmodels.

multimodalsentimentanalysisusingdeeplearningapproaches.Itcategorizesthedeeplearning-basedapproachesintothreeclasses:earlyfusion,latefusion,andtemporal-basedfusion.However,similartotheabovesurveys,thisreviewisnarrowlyfocusedonsentimentanalysis.Gaoetal.

[45

]provideasurveyonmultimodaldatafusion.Itintroducesthebasicconceptsofdeeplearningandseveralarchitecturesofdeepmultimodalmodels,includingstackedautoencoder-basedmethods,recurrentneuralnetworks-basedmethods,convolutionalneuralnetwork-basedmethods,andsoon.However,itdoesnotincludetheSOTAlargepre-trainedmodelsandGNNs-basedmethods,e.g.,theBERTmodel.Mengetal.

[121

]presentareviewofMLfordatafusion.ItemphasizesthetraditionalMLtechniquesinsteadofdeeplearningtechniques.Also,theauthorsclassifythemethodsintothreedifferentcategories:signal-levelfusion,feature-levelfusion,anddecision-levelfusion.Thewayofcategorizingthefusionmethodsissimilartothatoftheconventionaltaxonomy:earlyfusion,intermediatefusion,andlatefusion,whichisnotnewtothecommunity.Thereareseveralotherreviews[

4,

128,

227

]inthefieldofmultimodality,mostofwhichfocusonaspecificcombinationofmodalities,e.g.,RGB-depthimages.

Therefore,inthisarticle,weprovideacomprehensivesurveyandcategorizationofdeepmulti-modaldatafusion.Thecontributionsofthisreviewarethree-fold:

—Weprovideanovelfine-grainedtaxonomyofthedeepmultimodaldatafusionmodels,di-vergingfromexistingsurveysthatcategorizefusionmethodsaccordingtoconventionaltaxonomiessuchasearly,intermediate,late,andhybridfusion.Inthissurvey,weexplorethelatestadvancesandgrouptheSOTAfusionmethodsintofivecategories:Encoder-DecoderMethods,AttentionMechanismMethods,GNNMethods,GenNNMethods,andotherConstraint-basedMethods,asshowninFigure

5.

—Weprovideacomprehensivereviewofdeepmultimodaldatafusionconsistingofvariousmodalities,includingVision+Language,Vision+OtherSensors,andsoon.Comparedtotheexistingsurveys[

2,

4,

45,

46,

121,

128,

227,

235,

243

]thatusuallyfocusononesingletask(suchasmultimodalobjectrecognition)withonespecificcombinationoftwomodalities(suchasRGB+depthdata),thissurveyownsabroaderscopecoveringvariousmodalitiesandtheircorrespondingtasks,includingmultimodalobjectsegmentation,multimodalsentimentanalysis,VQA,andvideocaptioning,andsoon.

—Weexplorethenewtrendsofdeepmultimodaldatafusion,andcompareandcontrastSOTAmodels.Someoutdatedmethods,suchasdeepbeliefnetworks,areexcludedfromthisreview.However,thelargepre-trainedmodels,whicharerisingstarsofdeeplearning,areincludedinthereview,e.g.,Transformer-basedpre-trainedmodels.

Therestofthisarticleisorganizedasfollows.Section

2

introducesEncoder-Decoder-basedfusionmethods,inwhichthemethodsaregroupedintothreesub-classes.Section

3

presentsthe

SOTAAttentionmechanismsusedinmultimodaldatafusion.Inthissection,thelargepre-trainedACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

DeepMultimodalDataFusion216:7

Fig.6.ThegeneralstructureofEncoder-Decodermethodtofusemultimodaldata.Theinputdataofeachencodercanbetherawdataofeachmodalityorthefeaturesofeachmodality.Theencoderscanbeinde-pendentorshareweights.Thedecodercancontainupsamplingordownsamplingoperations,dependentonspecifictasks.

modelsareintroduced.InSection

4,

weintroduceGNN-basedmethods.InSection

5

,weintroduceGenNN-basedmethods,inwhichtwomainrolesofGenNN-basedmethodsinmultimodaltasksarepresented.Section

6

presentstheotherconstraintsadoptedinSOTAdeepmultimodalmodelssuchasTensor-basedFusion.InSection

7

,thecurrentnotabletasks,applications,anddatasetsinmultimodaldatafusionwillbeintroduced.Sections

8

and

9

discussthefuturedirectionsofmultimodaldatafusionandtheconclusionofthissurvey.

2ENCODER-DECODER-BASEDFUSION

Encoder-Decoderarchitecturehasbeensuccessfullyadoptedinsingle-modaltaskssuchasimagesegmentation,languagetranslation,datareduction,anddenoising.Insuchanarchitecture,theentirenetworkcanbedividedintotwomajorparts:theencoderpartandthedecoderpart.Theencoderpartusuallyworksasthehigh-levelfeatureextractor,whichprojectstheinputdataintoalatentspacewithrelativelylowerdimensionscomparedtotheoriginalinputdata.Inotherwords,theinputdatawillbetransformedintoitslatentrepresentationbytheencoder.Duringthisprocess,theimportantsemanticinformationoftheinputdatawillbepreserved,whilethenoiseintheinputdatawillberemoved.Aftertheencodingprocess,thedecoderwillgeneratea“prediction”fromthelatentrepresentationoftheinputdata.Forexample,inasemanticsegmentationtask,theexpectedoutputofthedecodercanbeasemanticsegmentationmapwiththesameresolutionastheinputdata.Inaseq-2-seqlanguagetranslationtask,theoutputcanbetheexpectedsequenceinthetargetlanguage.Indatadenoisingtasks,mostworksuseadecodertoreconstructtherawinputdata.

Owingtothestrongrepresentationlearningabilityandgoodflexibilityofthenetworkarchi-tectureofEncoder-Decodermodels,Encoder-Decoderhasbeenadoptedinmoreandmoredeepmultimodaldatafusionmodelsinrecentyears.Basedonthedifferencesintermsofthemodali-tiesandtasks,thearchitecturesofmultimodaldatafusionmodelsvaryfromeachotherwidely.Inthissurvey,wesummarizethegeneralideaoftheEncoder-Decoderfusionmethodsanddiscardsomeofthetask-specificfusionstrategiesthatcannotbegeneralized.ThegeneralstructureoftheEncoder-DecoderfusionisshowninFigure

6.

Aswecansee,thehigh-levelfeaturesobtainedfromdifferentindividualmodalitiesareprojectedintoalatentspace.Then,thetask-specificde-coderwillgeneratethepredictionfromthelearnedlatentrepresentationoftheinputmultimodaldata.Inrealscenarios,thereexistsplentyofvariationsofthisstructure.Wecategorizetheminto3sub-classes:raw-data-levelfusion,hierarchicalfeaturefusion,anddecision-levelfusion.

ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.

216:8F.Zhaoetal.

Fig.7.Visualizationsofdifferentmethods.

2.1Raw-data-levelFusion

Inthisfusion,therawdataofeachmodalityorthedataobtainedfromtheindependentpre-processingofeachmodalitywillbeintegratedattheinputlevel.Then,theformedinputvectorofthemultimodalitieswillbesenttooneencoderforextractinghigh-levelfeatures.Thedatafromindividualmodalitiesarefusedatalowlevel(e.g.,theinputlevel),andonlyoneencoderisappliedtoextractthehigh-levelfeaturesofmultimodaldata.Forexample,fortheimagesegmentationtask,Couprieetal.[

27

]proposethefirstdeeplearning-basedmultimodalfusionmodel.Inthiswork,theauthorsfusethemultimodaldataviaaconcatenationoperation,inwhichtheRGBimageandthedepthimageareconcatenatedalongthechannelaxis.Similarly,Liuetal.

[109

]concatenateRGBimageanddepthimagetogether.Theauthorsutilizedepthinformationtoassistcolorinforma-tionindetectingsalientobjectswithalowercomputationalcostcomparedtothedouble-streamnetworkwhichconsistsoftwoseparatedsub-networksdealingwithRGBdataanddepthdata,respectively.Thekeyadvantagesofthisfusionarethat(1)itcanmaximallypreservetheoriginalinformationofeachmodality,and(2)thedesi

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論