版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
DeepMultimodalDataFusion
FEIZHAO,
TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA
CHENGCUIZHANG,
TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA
BAOCHENGGENG,
TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA
MultimodalArtificialIntelligence(MultimodalAI),ingeneral,involvesvarioustypesofdata(e.g.,images,texts,ordatacollectedfromdifferentsensors),featureengineering(e.g.,extraction,combination/fusion),anddecision-making(e.g.,majorityvote).Asarchitecturesbecomemoreandmoresophisticated,multimodalneu-ralnetworkscanintegratefeatureextraction,featurefusion,anddecision-makingprocessesintoonesinglemodel.Theboundariesbetweenthoseprocessesareincreasinglyblurred.Theconventionalmultimodaldatafusiontaxonomy(e.g.,early/latefusion),basedonwhichthefusionoccursin,isnolongersuitableforthemoderndeeplearningera.Therefore,basedonthemain-streamtechniquesused,weproposeanewfine-grainedtaxonomygroupingthestate-of-the-art(SOTA)modelsintofiveclasses:Encoder-Decodermethods,AttentionMechanismmethods,GraphNeuralNetworkmethods,GenerativeNeuralNetworkmethods,andotherConstraint-basedmethods.Mostexistingsurveysonmultimodaldatafusionareonlyfocusedononespecifictaskwithacombinationoftwospecificmodalities.Unlikethose,thissurveycoversabroadercombi-nationofmodalities,includingVision+Language(e.g.,videos,texts),Vision+Sensors(e.g.,images,LiDAR),andsoon,andtheircorrespondingtasks(e.g.,videocaptioning,objectdetection).Moreover,acomparisonamongthesemethodsisprovided,aswellaschallengesandfuturedirectionsinthisarea.
CCSConcepts:?Computingmethodologies→Artificialintelligence;Naturallanguageprocessing;Computervision;Machinelearning;
AdditionalKeyWordsandPhrases:Datafusion,neuralnetworks,multimodaldeeplearning
ACMReferenceFormat:
FeiZhao,ChengcuiZhang,andBaochengGeng.2024.DeepMultimodalDataFusion.ACMComput.Surv.56,9,Article216(April2024),36pages.
/10.1145/3649447
1INTRODUCTION
Data,withoutadoubt,isanextremelyimportantcatalystintechnologicaldevelopment,especiallyinArtificialIntelligence(AI)field.Inthelast20years,theamountofdatageneratedinthisperiodaccountsforabout90%ofalldataavailableintheworld.Moreover,therateofdatagrowthisstillaccelerating.TheexplosionofdataprovidesanunprecedentedchanceforAItothrive.
Withtheadvancementofsensortechnologies,notonlytheamountandqualityofdataisin-creasedandenhanced,butthediversityofdataisalsoskyrocketing.Thedatacapturedfromdiffer-entsensorsprovidepeoplewithdistinct“views”or“perspectives”ofthesameobjects,activities,or
Authors’addresses:F.Zhao,TheUniversityofAlabamaatBirmingham,UniversityHall4105,140210thAve.S.,Birm-ingham,AL,35294,USA;e-mail:larry5@;C.Zhang,TheUniversityofAlabamaatBirmingham,UniversityHall4143,140210thAve.S.,Birmingham,AL,35294,USA;e-mail:czhang02@;B.Geng,TheUniversityofAlabamaatBirmingham,UniversityHall4147,140210thAve.S.,Birmingham,AL,35294,USA;e-mail:bgeng@.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponentsofthisworkownedbyothersthantheauthor(s)mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.Requestpermissionsfrom
permissions@.
?2024Copyrightheldbytheowner/author(s).PublicationrightslicensedtoACM.ACM0360-0300/2024/04-ART216
/10.1145/3649447
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:2F.Zhaoetal.
Fig.1.Theworldhasbeenprojectedintomultipledimensions/domains.
phenomena.Inotherwords,peopleareabletoobservethesameobjects,activities,orphenomenaindifferent“dimensions”or“domains”byusingdifferentsensors.Thesenew“views”helppeopleobtainabetterunderstandingoftheworld.Forexample,100yearsago,inthemedicalfield,itwasextremelydifficultforphysicianstodiagnosewhetherapatienthasalungtumorduetothelim-itedwayofobservingorgans.Aftertheinventionofthefirstcomputerizedtomography(CT)scannerbasedonX-raytechnology,thedatacapturedfromthemachineprovidemuchricherinfor-mationaboutlungs,enablingphysicianstomakediagnosesbasedonCTimagesalone.Withtheadvancementoftechnology,magneticresonanceimaging(MRI),amedicalimagingtechniquethatusesstrongmagneticfieldsandradiowaves,hasbeenusedtodetecttumorsaswell.Nowadays,physiciansareabletoaccessmultimodaldataincludingCT,MRI,andbloodtestdata,andsoon.Theaccuracyofdiagnosisbasedonthecombinationofthesedataismuchhigher,comparedwiththatbasedonasinglemodalityalone,e.g.,CT,orMRIonly.ThisisbecausethecomplementaryandredundantinformationamongCT,MRI,andbloodtestdatacanhelpphysiciansbuildamorecomprehensiveviewofanobservedobject,activity,orphenomenon.EvolutionofAIalsofollowsasimilarpath.Initsinfancy,AIonlyfocusesonsolvingproblemsusingasinglemodality.Nowadays,AItoolshavebecomeincreasinglycapableofsolvingreal-worldproblemsbyusingmultimodality.
Whatismultimodality?Inreality,whenweexperiencetheworld,weseeobjects,hearsounds,feeltextures,smellodors,andtasteflavors[
11
].Theworldisrepresentedbyinformationindif-ferentmediums,e.g.,vision,sounds,andtextures.AvisualizationisshowninFigure
1
.Ourrecep-torssuchaseyesandears,helpuscapturetheinformation.Then,ourbrainwillbeabletofusetheinformationfromdifferentreceptorstoformapredictionoradecision.Theinformationob-tainedfromeachsource/mediumcanbeviewedasonemodality.Whenthenumberofmodalitiesisgreaterthanone,wecallitmultimodality.However,insteadofusingeyesandears,machineshighlydependonsensorssuchasRGBcameras,microphones,orothertypesofsensors,asshowninFigure
2.
Eachsensorcanmaptheobservedobjects/activitiesintoitsowndimension.Inotherwords,theobservedobjects/activitiescanbeprojectedintothedimensionofeachsensor.Then,machinesorrobotscancollectthedatafromeachsensorandmakeapredictionordecisionbasedonthem.Intheindustry,therearenumerousapplicationstakingadvantageofmultimodality.Forexample,autonomousvehicle,whichisoneofthehottesttopicssincethe2020s,isatypicalap-plicationrelyingonmultimodality.Suchasystemrequiresmultipletypesofdatafromdifferentsensors,e.g.,LiDARsensors,Radarsensors,cameras,andGPS.Themodelwillfusethesedatatomakereal-timepredictions.Inthemedicalfield,moreandmoreapplicationsrelyonthefusionofmedicalimagingandelectronichealthrecordstoenablemodelstoanalyzeimagingfindingsintheclinicalcontext,e.g.,CTandMRIfusion.
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
DeepMultimodalDataFusion216:3
Fig.2.Theworldhasbeenprojectedintomultipledimensions/domainsbydifferenttypesofsensors.
Fig.3.Thekidisstrikingadrum.Evenifthedrumisnotvisible,basedonthevisionandaudioinformation,wecanstillrecognizetheactivitycorrectly.
Whydoweneedmultimodality?Ingeneral,multimodaldatarefertothedatacollectedfromdifferentsensors,e.g.,CTimages,MRIimages,andbloodtestdataforthecancerdiagnosis,RGBdataandLiDARdataforautonomousdrivingsystem,RGBdataandinfrareddataforskeletondetectionofKinect[
28
].Forthesameobservedobjectoractivity,thedatafromdifferentmodali-tiescanhavedistinctexpressionsandperspectives.Althoughthecharacteristicsofthesedatacanbeindependentanddistinct,theyoftenoverlapsemantically.Thisphenomenoniscalledinfor-mationredundancy.Furthermore,informationfromdifferentmodalitiescanbecomplementary.Humanscanunconsciouslyfusethemultimodaldata,obtainknowledge,andmakepredictions.Thecomplementaryandredundantinformationextractedfrommultimodalitiescanhelphumansformacomprehensiveunderstandingoftheworld.AstheexampleshowninFigure
3
,whenakidisdrumming,evenifwecannotseethedrum,wearestillabletorecognizeadrumthatisbeingstruckbasedonthesounds.Inthisprocess,weunconsciouslyfusethevisionandacousticdata,andextractthecomplementaryinformationofthem,tomakeacorrectprediction.Ifthereisonlyonemodalityavailable,e.g.,visionmodalitywiththedrumobjectoutofsight,wecanonlytellthatakidiswavingtwosticks.Withonlythesoundavailable,wewouldonlybeabletotellthatadrumisbeingstruckwithoutknowingwhoisdrumming.Therefore,ingeneral,theindependentinterpretationbasedonindividualmodalityonlypresentspartialinformationoftheobservedactivity.However,themultimodality-basedinterpretationcandeliverthe“fullerpicture”oftheobservedactivity,whichcanbemorerobustandreliablethansingle-modality-basedmodels.Forinstance,autonomousvehiclescontainingmultiplesensorssuchasRGBcamerasandLiDARsensors,needtodetectobjectsontheroadinextremeweatherconditionswherevisibilityisnearzero,e.g.,densefogorheavyrain.Amultimodal-basedmodelcanstilldetectobjectswhilethepure-vision-basedmodelscannot.However,itisextremelyhardformachinestounderstandandfigureouthowtofuseandtakeadvantagesofthecomplementarynatureofmultimodaldatatoimprovetheprediction/classificationaccuracy.
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:4F.Zhaoetal.
Howtofusemultimodaldata?Inthe1990s,astraditionalMachineLearning(ML),asubclassofAI,flourished,ML-basedmodelsforaddressingmultimodalproblemsbegantothrive.Itbecamecommonforthemachinetoextractknowledgefrommultimodaldataandmakedecisions.How-ever,backthenmostoftheworkswerefocusedonfeatureengineering,e.g.,howtoobtainabetterrepresentationforeachmodality.Duringthattime,manymodality-specifichand-craftedfeatureextractorswereproposed,whichgreatlyrelyonpriorknowledgeofthespecifictasksandthecor-respondingdata.Sincethesefeatureextractorsworkindependently,theycanhardlycapturethecomplementaryandredundantnatureofmultimodalities.Therefore,suchafeatureengineeringprocessinevitablyresultsinalossofinformationbeforethefeaturesaresenttotheML-basedmodel.ThisleadstoanegativeimpactontheperformanceofthetraditionalML-basedmodels.AlthoughtraditionalML-basedmodelshavetheabilitytoanalyzemultimodalinformation,thereisalongwaytoachievetheultimategoalofAI,whichistomimichumansorevensurpasshumanperformance.Therefore,howtofusethedatainawaythatcanautomaticallylearnthecomple-mentaryandredundantinformationandminimizethemanualinterferenceremainsaprobleminthetraditionalMLfield.
Deeplearningisasub-fieldofML.Itallowscomputationalmodelsthatarecomposedofmultipleprocessinglayerstolearnrepresentationsofdatawithmultiplelevelsofabstraction[
88
].Itskeyadvantageisthatthehierarchicalrepresentationscanbelearnedinanautomatedway,whichdoesnotrequiredomainknowledgeorhumaneffort.Forexample,tothedata:X={x1,x2,...,xN},Y={y1,y2,...,yN},atwo-layerneuralnetworkcanbedefinedasthecombinationofmatricesW,andnon-linearfunctionσ(·)asshowninEquation(
1)
.Aftertrainingprocess,wecanfindWforwhichiisclosetoyiforalli≤N.Asthedepthofthemodelcontinuestoincrease,sodoesitsabilityoffeaturerepresentation.
i(xi)=W2·σ(W1xi).(1)
Since2010,multimodaldatafusionhasenteredthestageofdeeplearninginanall-aroundway.Deeplearning-basedmultimodaldatafusionmethodshavedemonstratedoutstandingre-sultsinvariousapplications.Forvideo-audio-basedmultimodaldatafusion,theworksfrom
[35,
37,
51,
163
]addresstheemotionrecognitionproblembyusingdeeplearningtechniques, includingconvolutionalneuralnetworks,longshort-termmemory(LSTM)networks,at- tentionmechanisms,andsoon.Also,forvideo-textmultimodaldatafusion,theworksfrom
[41,
56,
68,
107,
123,
124,
195
]addressthetext-to-videoretrievaltaskbyusingTransformer,BERT,attentionmechanism,adversariallearning,andacombinationofthem.Therearevariousothermultimodaltasks,e.g.,visualquestionanswering(VQA)(text-image:[
154,
220
],text-video:
[82,
223
]),RGB-depthobjectsegmentation[
31,
39
],medicaldataanalysis[
181,
185
],andimagecaptioning[
216,
237
].ComparedtotraditionalML-basedmethods,deepneuralnetwork(DNN)-basedmethodsshowsuperiorperformanceonrepresentationlearningandmodalityfusioniftheamountofthetrainingdataislargeenough.Furthermore,DNNisabletoexecutefeatureengineer- ingbyitself,whichmeansahierarchicalrepresentationcanbeautomaticallylearnedfromdata, insteadofmanuallydesigningorhandcraftingmodality-specificfeatures.Traditionally,themeth-odsofmultimodaldatafusionareclassifiedintofourcategories,basedontheconventionalfusion taxonomyshowninFigure
4
,includingearlyfusion,intermediatefusion,latefusion,andhybridfusion:(1)earlyfusion:Therawdataorpre-processeddataobtainedfromeachmodalityarefusedbeforebeingsenttothemodel;(2)intermediatefusion:thefeaturesextractedfromdifferentmodal- itiesarefusedtogetherandsenttothemodelfordecisionmaking;(3)latefusion(alsoknownas“decisionfuse”):theindividualdecisionsobtainedfromeachmodalityarefusedtoformthefinalprediction,e.g.,majorityvoteorweightedaverage,orametaMLmodelontopofindividualdeci-
sions.(4)hybridfusion:acombinationofearly,intermediate,andlatefusion.WithlargeamountsACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
DeepMultimodalDataFusion216:5
Fig.4.Theconventionaltaxonomycategorizesfusionmethodsintothreeclasses.
ofmultimodaldataavailable,theneedformoreadvancedmethods(VShandpickedwaysoffusion)tofusethemhasgrownunprecedentedly.However,thisconventionalfusiontaxonomycanonlyprovidebasicguidanceformultimodaldatafusion.Inordertoextractthericherrepresentationfrommultimodaldata,thearchitectureofDNNbecomesmoreandmoresophisticated,whichnolongerextractsfeaturesfromeachmodalityseparatelyandindependently.Instead,representationlearning,modalityfusing,anddecisionmakingareinterlacedinmostcases.Therefore,thereisnoneedtospecifyexactlyinwhichpartofthenetworkthemultimodaldatafusionoccurs.Themethodoffusingmultimodaldatahaschangedfromtraditionalexplicitways,e.g.,earlyfusion,in-termediatefusion,andlatefusion,tomoreimplicitways.ToforcetheDNNtolearnhowtoextractcomplementaryandredundantinformationofmultimodaldata,researchershaveinventedvariousconstraintsonDNN,includingspecificallydesignednetworkarchitecturesandregularizationsonlossfunctions,andsoon.Therefore,thedevelopmentofdeeplearninghassignificantlyreshapedthelandscapeofmultimodaldatafusion,revealingtheinadequaciesofthetraditionaltaxonomyoffusionmethods.Theinherentcomplexityofdeeplearningarchitecturesofteninterlacesrepre-sentationlearning,modalityfusing,anddecision-making,defyingthesimplisticcategorizationsofthepast.Furthermore,theshiftfromexplicittomoreimplicitfusionmethods,exemplifiedbyattentionmechanisms,haschallengedthestaticnatureoftraditionalfusionstrategies.Techniquessuchasgraphneuralnetworks(GNNs)andgenerativeneuralnetworks(GenNNs)intro-ducenovelwaysofhandlingandfusingdatathatarenotalignedwiththeearly-to-latefusionframework.Additionally,thedynamicandadaptivefusioncapabilitiesofdeepmodels,coupledwiththechallengesposedbylarge-scaledata,necessitatemoresophisticatedfusionmethodsthantheconventionalcategoriescanencapsulate.Recognizingthesecomplexitiesandtherapidevolu-tion,itbecomesimperativetointroduceataxonomythatdelvesdeeper,capturingthesubtletiesofcontemporaryfusionmethods.
Formultimodaldatafusion,thereareseveralrecentsurveysavailableinthesciencecommunity.Gaoetal.
[46
]provideareviewonmultimodalneuralnetworksandSOTAarchitectures.However,thereviewisonlyfocusedonanarrowresearcharea:theobjectrecognitiontaskforRGB–depthimages.Moreover,thissurveyislimitedtotheconvolutionalneuralnetworks.Zhangetal.
[235
]presentasurveyondeepmultimodalfusion.However,theauthorscategorizethemodelsusingtheconventionaltaxonomy:earlyfusion,latefusion,andhybridfusion.Furthermore,thissurvey
isfocusedontheimagesegmentationtaskonly.Abduetal.
[2
]providealiteraturereviewofACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:6F.Zhaoetal.
Fig.5.Thediagramofourproposedfine-grainedtaxonomyofdeepmultimodaldatafusionmodels.
multimodalsentimentanalysisusingdeeplearningapproaches.Itcategorizesthedeeplearning-basedapproachesintothreeclasses:earlyfusion,latefusion,andtemporal-basedfusion.However,similartotheabovesurveys,thisreviewisnarrowlyfocusedonsentimentanalysis.Gaoetal.
[45
]provideasurveyonmultimodaldatafusion.Itintroducesthebasicconceptsofdeeplearningandseveralarchitecturesofdeepmultimodalmodels,includingstackedautoencoder-basedmethods,recurrentneuralnetworks-basedmethods,convolutionalneuralnetwork-basedmethods,andsoon.However,itdoesnotincludetheSOTAlargepre-trainedmodelsandGNNs-basedmethods,e.g.,theBERTmodel.Mengetal.
[121
]presentareviewofMLfordatafusion.ItemphasizesthetraditionalMLtechniquesinsteadofdeeplearningtechniques.Also,theauthorsclassifythemethodsintothreedifferentcategories:signal-levelfusion,feature-levelfusion,anddecision-levelfusion.Thewayofcategorizingthefusionmethodsissimilartothatoftheconventionaltaxonomy:earlyfusion,intermediatefusion,andlatefusion,whichisnotnewtothecommunity.Thereareseveralotherreviews[
4,
128,
227
]inthefieldofmultimodality,mostofwhichfocusonaspecificcombinationofmodalities,e.g.,RGB-depthimages.
Therefore,inthisarticle,weprovideacomprehensivesurveyandcategorizationofdeepmulti-modaldatafusion.Thecontributionsofthisreviewarethree-fold:
—Weprovideanovelfine-grainedtaxonomyofthedeepmultimodaldatafusionmodels,di-vergingfromexistingsurveysthatcategorizefusionmethodsaccordingtoconventionaltaxonomiessuchasearly,intermediate,late,andhybridfusion.Inthissurvey,weexplorethelatestadvancesandgrouptheSOTAfusionmethodsintofivecategories:Encoder-DecoderMethods,AttentionMechanismMethods,GNNMethods,GenNNMethods,andotherConstraint-basedMethods,asshowninFigure
5.
—Weprovideacomprehensivereviewofdeepmultimodaldatafusionconsistingofvariousmodalities,includingVision+Language,Vision+OtherSensors,andsoon.Comparedtotheexistingsurveys[
2,
4,
45,
46,
121,
128,
227,
235,
243
]thatusuallyfocusononesingletask(suchasmultimodalobjectrecognition)withonespecificcombinationoftwomodalities(suchasRGB+depthdata),thissurveyownsabroaderscopecoveringvariousmodalitiesandtheircorrespondingtasks,includingmultimodalobjectsegmentation,multimodalsentimentanalysis,VQA,andvideocaptioning,andsoon.
—Weexplorethenewtrendsofdeepmultimodaldatafusion,andcompareandcontrastSOTAmodels.Someoutdatedmethods,suchasdeepbeliefnetworks,areexcludedfromthisreview.However,thelargepre-trainedmodels,whicharerisingstarsofdeeplearning,areincludedinthereview,e.g.,Transformer-basedpre-trainedmodels.
Therestofthisarticleisorganizedasfollows.Section
2
introducesEncoder-Decoder-basedfusionmethods,inwhichthemethodsaregroupedintothreesub-classes.Section
3
presentsthe
SOTAAttentionmechanismsusedinmultimodaldatafusion.Inthissection,thelargepre-trainedACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
DeepMultimodalDataFusion216:7
Fig.6.ThegeneralstructureofEncoder-Decodermethodtofusemultimodaldata.Theinputdataofeachencodercanbetherawdataofeachmodalityorthefeaturesofeachmodality.Theencoderscanbeinde-pendentorshareweights.Thedecodercancontainupsamplingordownsamplingoperations,dependentonspecifictasks.
modelsareintroduced.InSection
4,
weintroduceGNN-basedmethods.InSection
5
,weintroduceGenNN-basedmethods,inwhichtwomainrolesofGenNN-basedmethodsinmultimodaltasksarepresented.Section
6
presentstheotherconstraintsadoptedinSOTAdeepmultimodalmodelssuchasTensor-basedFusion.InSection
7
,thecurrentnotabletasks,applications,anddatasetsinmultimodaldatafusionwillbeintroduced.Sections
8
and
9
discussthefuturedirectionsofmultimodaldatafusionandtheconclusionofthissurvey.
2ENCODER-DECODER-BASEDFUSION
Encoder-Decoderarchitecturehasbeensuccessfullyadoptedinsingle-modaltaskssuchasimagesegmentation,languagetranslation,datareduction,anddenoising.Insuchanarchitecture,theentirenetworkcanbedividedintotwomajorparts:theencoderpartandthedecoderpart.Theencoderpartusuallyworksasthehigh-levelfeatureextractor,whichprojectstheinputdataintoalatentspacewithrelativelylowerdimensionscomparedtotheoriginalinputdata.Inotherwords,theinputdatawillbetransformedintoitslatentrepresentationbytheencoder.Duringthisprocess,theimportantsemanticinformationoftheinputdatawillbepreserved,whilethenoiseintheinputdatawillberemoved.Aftertheencodingprocess,thedecoderwillgeneratea“prediction”fromthelatentrepresentationoftheinputdata.Forexample,inasemanticsegmentationtask,theexpectedoutputofthedecodercanbeasemanticsegmentationmapwiththesameresolutionastheinputdata.Inaseq-2-seqlanguagetranslationtask,theoutputcanbetheexpectedsequenceinthetargetlanguage.Indatadenoisingtasks,mostworksuseadecodertoreconstructtherawinputdata.
Owingtothestrongrepresentationlearningabilityandgoodflexibilityofthenetworkarchi-tectureofEncoder-Decodermodels,Encoder-Decoderhasbeenadoptedinmoreandmoredeepmultimodaldatafusionmodelsinrecentyears.Basedonthedifferencesintermsofthemodali-tiesandtasks,thearchitecturesofmultimodaldatafusionmodelsvaryfromeachotherwidely.Inthissurvey,wesummarizethegeneralideaoftheEncoder-Decoderfusionmethodsanddiscardsomeofthetask-specificfusionstrategiesthatcannotbegeneralized.ThegeneralstructureoftheEncoder-DecoderfusionisshowninFigure
6.
Aswecansee,thehigh-levelfeaturesobtainedfromdifferentindividualmodalitiesareprojectedintoalatentspace.Then,thetask-specificde-coderwillgeneratethepredictionfromthelearnedlatentrepresentationoftheinputmultimodaldata.Inrealscenarios,thereexistsplentyofvariationsofthisstructure.Wecategorizetheminto3sub-classes:raw-data-levelfusion,hierarchicalfeaturefusion,anddecision-levelfusion.
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:8F.Zhaoetal.
Fig.7.Visualizationsofdifferentmethods.
2.1Raw-data-levelFusion
Inthisfusion,therawdataofeachmodalityorthedataobtainedfromtheindependentpre-processingofeachmodalitywillbeintegratedattheinputlevel.Then,theformedinputvectorofthemultimodalitieswillbesenttooneencoderforextractinghigh-levelfeatures.Thedatafromindividualmodalitiesarefusedatalowlevel(e.g.,theinputlevel),andonlyoneencoderisappliedtoextractthehigh-levelfeaturesofmultimodaldata.Forexample,fortheimagesegmentationtask,Couprieetal.[
27
]proposethefirstdeeplearning-basedmultimodalfusionmodel.Inthiswork,theauthorsfusethemultimodaldataviaaconcatenationoperation,inwhichtheRGBimageandthedepthimageareconcatenatedalongthechannelaxis.Similarly,Liuetal.
[109
]concatenateRGBimageanddepthimagetogether.Theauthorsutilizedepthinformationtoassistcolorinforma-tionindetectingsalientobjectswithalowercomputationalcostcomparedtothedouble-streamnetworkwhichconsistsoftwoseparatedsub-networksdealingwithRGBdataanddepthdata,respectively.Thekeyadvantagesofthisfusionarethat(1)itcanmaximallypreservetheoriginalinformationofeachmodality,and(2)thedesi
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 供電局微笑服務演講稿
- 員工代表演講稿
- 企業(yè)普通員工年終工作總結
- 去音標課件教學課件
- 晚上做課件教學課件
- 探礦全證辦理流程
- 《EDA技術與設計》全套教學課件
- 部編版歷史九年級上冊第三單元 第10課《拜占庭帝國和查士丁尼法典》說課稿
- 實數(shù)復習課件教學課件
- 建功中學聯(lián)盟九年級上學期語文10月月考試卷
- 英語-浙江省湖州、衢州、麗水2024年11月三地市高三教學質(zhì)量檢測試卷試題和答案
- 勞動技術教案
- 廣東省深圳市2023-2024學年高一上學期生物期中試卷(含答案)
- 第七章 立體幾何與空間向量綜合測試卷(新高考專用)(學生版) 2025年高考數(shù)學一輪復習專練(新高考專用)
- 大學美育(同濟大學版)學習通超星期末考試答案章節(jié)答案2024年
- 中國急性缺血性卒中診治指南(2023版)
- 福建省殘疾人崗位精英職業(yè)技能競賽(美甲師)參考試題及答案
- 在線學習新變革課件 2024-2025學年人教版(2024)初中信息技術七年級全一冊
- 勞動法律學習試題
- 航空器系統(tǒng)與動力裝置學習通超星期末考試答案章節(jié)答案2024年
- 過敏性休克完整版本
評論
0/150
提交評論