![多模態(tài)人機(jī)交互綜述(譯文)_第1頁](http://file4.renrendoc.com/view/5be217c4a04ca0441af821f72b98cc76/5be217c4a04ca0441af821f72b98cc761.gif)
![多模態(tài)人機(jī)交互綜述(譯文)_第2頁](http://file4.renrendoc.com/view/5be217c4a04ca0441af821f72b98cc76/5be217c4a04ca0441af821f72b98cc762.gif)
![多模態(tài)人機(jī)交互綜述(譯文)_第3頁](http://file4.renrendoc.com/view/5be217c4a04ca0441af821f72b98cc76/5be217c4a04ca0441af821f72b98cc763.gif)
![多模態(tài)人機(jī)交互綜述(譯文)_第4頁](http://file4.renrendoc.com/view/5be217c4a04ca0441af821f72b98cc76/5be217c4a04ca0441af821f72b98cc764.gif)
![多模態(tài)人機(jī)交互綜述(譯文)_第5頁](http://file4.renrendoc.com/view/5be217c4a04ca0441af821f72b98cc76/5be217c4a04ca0441af821f72b98cc765.gif)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
AlejandroJaimes,NicuSebe,Multimodalhumancomputerinteraction:Asurvey,ComputerVisionandImageUnderstanding,2007.多模態(tài)人機(jī)交互綜述摘要:本文總結(jié)了多模態(tài)人機(jī)交互(MMHCI,Multi-ModalHuman-ComputerInteraction的主要方法,從計(jì)算機(jī)視覺角度給出了領(lǐng)域的全貌。我們尤其將重點(diǎn)放在身體、手勢(shì)、視線和情感交互(人臉表情識(shí)別和語音中的情感)方面,討論了用戶和任務(wù)建模及多模態(tài)融合(multimodalfusion),并指出了多模態(tài)人機(jī)交互研究的挑戰(zhàn)、熱點(diǎn)課題和興起的應(yīng)用(highlightingchallenges,openissues,andemergingapplications)。1.引言多模態(tài)人機(jī)交互(MMHCI)位于包括計(jì)算機(jī)視覺、心理學(xué)、人工智能等多個(gè)研究領(lǐng)域的交叉點(diǎn),我們研究MMHCI是要使得計(jì)算機(jī)技術(shù)對(duì)人類更具可用性(Usable),這總是需要至少理解三個(gè)方面:與計(jì)算機(jī)交互的用戶、系統(tǒng)(計(jì)算機(jī)技術(shù)及其可用性)和用戶與系統(tǒng)間的交互??紤]這些方面,可以明顯看出MMHCI是一個(gè)多學(xué)科課題,因?yàn)榻换ハ到y(tǒng)設(shè)計(jì)者應(yīng)該具有一系列相關(guān)知識(shí):心理學(xué)和認(rèn)知科學(xué)來理解用戶的感知、認(rèn)知及問題求解能力(perceptual,cognitive,andproblemsolvingskills);社會(huì)學(xué)來理解更寬廣的交互上下文;工效學(xué)(ergonomics)來理解用戶的物理能力;圖形設(shè)計(jì)來生成有效的界面展現(xiàn);計(jì)算機(jī)科學(xué)和工程來建立必需的技術(shù);等等。MMHCI的多學(xué)科特性促使我們對(duì)此進(jìn)行總結(jié)。我們不是將重點(diǎn)只放在MMHCI的計(jì)算機(jī)視覺技術(shù)方面,而是給出了這個(gè)領(lǐng)域的全貌,從計(jì)算機(jī)視覺角度I討論了MMHCI中的主要方法和課題。動(dòng)機(jī)在人與人通信中本質(zhì)上要解釋語音和視覺信號(hào)的混合。很多領(lǐng)域的研究者認(rèn)識(shí)到了這點(diǎn),并在單一模態(tài)技術(shù)unimodaltechniques(語音和音頻處理及計(jì)算機(jī)視覺等)和硬件技術(shù)hardwaretechnologies(廉價(jià)的攝像機(jī)和其它類型傳感器)的研究方面取得了進(jìn)步,這使得MMHCI方面的研究已經(jīng)有了重要進(jìn)展。與傳統(tǒng)HCI應(yīng)用(單個(gè)用戶面對(duì)計(jì)算機(jī)并利用鼠標(biāo)或鍵盤與之交互)不同,在新的應(yīng)用(如:智能家居[105]、遠(yuǎn)程協(xié)作、藝術(shù)等)中,交互并非總是顯式指令(explicitcommands),且經(jīng)常包含多個(gè)用戶。部分原因式在過去的幾年中計(jì)算機(jī)處理器速度、記憶和存儲(chǔ)能力得到了顯著進(jìn)步,并與很多使普適計(jì)算ubiquitouscomputing[185,67,66]成為現(xiàn)實(shí)的新穎輸入和輸出設(shè)備的有效性相匹配,設(shè)備包括電話(phones)、嵌入式系統(tǒng)(embeddedsystems)、個(gè)人數(shù)字助理(PDA)、筆記本電腦(laptops)>屏幕墻(wallsizedisplays),等等,大量計(jì)算具有不同計(jì)算能量和輸入輸出能力的設(shè)備可用意味著計(jì)算的未來將包含交互的新途徑,一些方法包括手勢(shì)(gestures)[136]、語音(speech)[143]、觸覺(haptics)[9]、眨眼(eyeblinks)[58]和其它方法,例如:手套設(shè)備(Glovemounteddevices)[19]和and可抓握用戶界面(graspableuserinterfaces)[48]及有形用戶界面(TangibleUserinterface)現(xiàn)在似乎趨向成熟(ripeforexploration),具有觸覺反饋、視線跟蹤和眨眼檢測(cè)[69啲點(diǎn)設(shè)備(Pointingdevices)現(xiàn)也已出現(xiàn)。然而,恰如在人與人通訊中一樣,當(dāng)以組合方式使用不同輸入設(shè)備時(shí),情感通訊(effectivecommunication)就會(huì)發(fā)生。多模態(tài)界面具有很多優(yōu)點(diǎn)[34]:可以防止錯(cuò)誤、為界面帶來魯棒性、幫助用戶更簡單地糾正錯(cuò)誤或復(fù)原、為通信帶來更寬的帶寬、對(duì)不同的狀況和環(huán)境增加可選的通信方法。在很多系統(tǒng)中,采用多模態(tài)接口消除易出錯(cuò)模態(tài)(errorpronemodalities)的模糊性是多模態(tài)應(yīng)用的重要?jiǎng)訖C(jī)之一,如Oviatt[123]所述,易出錯(cuò)技術(shù)可以相互補(bǔ)充,而不是給接口帶來冗余和減少糾錯(cuò)的需要。然而,必須指出的是:多模態(tài)單獨(dú)(multiplemodalitiesalone)并不為界面帶來好處,多模態(tài)的使用可能是無效的(ineffective),甚至是無益的(disadvantageous),據(jù)此,Oviatt[124]已經(jīng)提出了多模態(tài)接口的共同錯(cuò)誤概念(commonmisconceptionsormyths),其中大多數(shù)與采用語音作為輸入模態(tài)相關(guān)。本文中,我們調(diào)研了我們認(rèn)為是MMHCI本質(zhì)的研究領(lǐng)域,概括了當(dāng)前研究狀況(thestateoftheart),并以我們的調(diào)研結(jié)果為基礎(chǔ),給出了MMHCI中的主要趨勢(shì)和研究課題(identifymajortrendsandopenissues)。我們按照人體將視覺技術(shù)進(jìn)行了分組(如圖1所示)。大規(guī)模軀體運(yùn)動(dòng)(Largescalebodymovement)、姿勢(shì)(gesture)和注視(gaze)分析用于諸如情感交互中的表情識(shí)別任務(wù)或其它各種應(yīng)用。我們討論了情感計(jì)算機(jī)交互(affectivecomputerinteraction),多模態(tài)融合、建模和數(shù)據(jù)收集中的課題及各種正在出現(xiàn)的MMHCI應(yīng)用。由于MMHCI是一個(gè)非常動(dòng)態(tài)和廣泛的研究領(lǐng)域,我們不是去呈現(xiàn)完整的概括,因此,本文的主要貢獻(xiàn)是在對(duì)在MMHCI中使用的主要計(jì)算機(jī)視覺技術(shù)概括的同時(shí),給出對(duì)MMHCI中的主要研究領(lǐng)域、技術(shù)、應(yīng)用和開放課題的綜述。Fig.1.采用以人為中心多模態(tài)交互概略Relatedsurveys已經(jīng)有在多個(gè)領(lǐng)域中廣泛的綜述發(fā)表,諸如人臉檢測(cè)[190,63],人臉識(shí)別[196],人臉表情分析(facialexpressionanalysis)[47,131],語音情感(vocalemotion)[119,109],姿態(tài)識(shí)別(gesturerecognition)[96,174,136],人運(yùn)動(dòng)分析(humanmotionanalysis)[65,182,182,56,3,46,107],聲音-視覺自動(dòng)語音識(shí)別(audio-visualautomaticspeechrecognition)[143]和眼跟蹤(eyetracking)[41,36]。對(duì)基于視覺HCI的綜述呈現(xiàn)在[142]和[73]中,其重點(diǎn)是頭部跟蹤(headtracking),人臉和臉部表情識(shí)別(faceandfacialexpressionrecognition),眼睛跟蹤(eyetracking)及姿態(tài)識(shí)別(gesturerecognition)。文[40]中討論了自適應(yīng)和智能HCI,主要是對(duì)用于人體運(yùn)動(dòng)分析的計(jì)算機(jī)視覺的綜述和較低手臂運(yùn)動(dòng)檢測(cè)、人臉處理和注視分析技術(shù)的討論;[125-128,144,158,135,171]中討論了多模態(tài)接口。[84]和[77沖討論了HCI的實(shí)時(shí)視覺技術(shù)(Real-timevision),包括人體姿態(tài)、對(duì)象跟蹤、手勢(shì)、注視力和臉姿態(tài)等。這里,我們不討論前面綜述中包含的工作,增加前面綜述中沒有覆蓋的領(lǐng)域(如:[84,40,142,126,115]),并討論在興起領(lǐng)域中的新的應(yīng)用,著重指出了主要研究課題。相關(guān)的的會(huì)議和討論會(huì)包括:ACMCHI、IFIPInteract、IEEECVPR、IEEEICCV、ACMMultimedia、InternationalWorkshoponHuman-CenteredMultimedia(HCM)inconjunctionwithACMMultimedia、InternationalWorkshopsonHuman-ComputerInteractioninconjunctionwithICCVandECCV、IntelligentUserInterfaces(IUI)conferenee和InternationalConferenceonMultimodalInterfaces(ICMI)。2.多模態(tài)交互概要術(shù)語“multimodal”已經(jīng)在很多場合使用并產(chǎn)生了多種釋義(如[10-12沖對(duì)模態(tài)的解釋)。對(duì)于我們來講,多模態(tài)HCI系統(tǒng)簡單地是一個(gè)以多種模態(tài)或通信通道響應(yīng)輸入的系統(tǒng)(如:語音speech、姿態(tài)gesture、書寫writing和其它等等),我們采用“以人為中心”的方法(human-centeredapproach),所指的“借助于模態(tài)(bymodality)”意味著按照人的感知(humansenses)的通信模式和由人激活或衡量人的量(如:血壓計(jì))的計(jì)算機(jī)輸入設(shè)備,如圖1所示。人的感知包括視線(sight)、觸覺(touch)、聽力(hearing)、嗅覺(smell)和味覺(taste);很多計(jì)算機(jī)輸入設(shè)備的輸入模態(tài)對(duì)應(yīng)于人的感知:攝像機(jī)cameras(sight)、觸覺傳感器hapticsensors(touch)[9]、麥克風(fēng)microphones(hearing)、嗅覺設(shè)備olfactory(smell)和味覺設(shè)備taste[92],然而,很多其它由人激活的計(jì)算機(jī)輸入設(shè)備可以理解為對(duì)于人的感覺的組合或就沒有對(duì)應(yīng)物,如:鍵盤(keyboard)、鼠標(biāo)(mouse)、手寫板(writingtablet)、運(yùn)動(dòng)輸入(motioninput)(如:自身運(yùn)動(dòng)用來交互的設(shè)備)、電磁皮膚感應(yīng)器(galvanicskinresponse)和其它生物傳感器(biometricsensors)。.在我們的定義中,字“nput”是最重要的,恰如在實(shí)際中大多數(shù)與計(jì)算機(jī)的交互都采用多個(gè)模態(tài)而發(fā)生。例如:當(dāng)我們打字時(shí),我們接觸鍵盤上的鍵以將數(shù)據(jù)輸入計(jì)算機(jī),但有些人也匯同時(shí)用視線閱讀我們所輸入的或確定所要按壓鍵的位置。因此,牢記交互過程中人所在做的(whatthehumanisdoing)與系統(tǒng)實(shí)際接收作為輸入(whatthesystemisactuallyreceivingasinput間的差異是十分重要的。例如,一臺(tái)裝有麥克風(fēng)的計(jì)算機(jī)可能能理解多種語言和僅是不同類型的聲音(如:采用人性化界面(humminginterface)來進(jìn)行音樂檢索),盡管術(shù)語“multimodal”已常用來指這種狀況(如:[13]中的多語言輸入被認(rèn)為是多通道的multimodal),但本文僅指那些采用不同模態(tài)(如:通信通道)結(jié)合的系統(tǒng)是多模態(tài)的,如圖1所示。例如:一個(gè)系統(tǒng)僅采用攝像機(jī)對(duì)人臉表情和手勢(shì)作出響應(yīng)就不是多模態(tài)的,即使輸入信號(hào)來自多個(gè)攝像機(jī);利用同樣的假設(shè),具有多個(gè)鍵的系統(tǒng)也不是多模態(tài)的,但具有鼠標(biāo)和鍵盤輸入的則是多模態(tài)的。盡管對(duì)采用諸如鼠標(biāo)和鍵盤、鍵盤和筆等多種設(shè)備的多模態(tài)交互已經(jīng)有研究,本文僅涉及視覺(攝像機(jī))輸入與其它類型輸入結(jié)合的人機(jī)交互技術(shù)。在HCI中,多模態(tài)技術(shù)可以用來構(gòu)造多種不同類型的界面(如圖1),我們特別感興趣的是感知(perceptual)>注意(attentive)和活躍(enactive)界面。如在[177]中定義的那樣,感知界面(Perceptualinterfaces)[176]是高度互動(dòng)(interactive)且能使得與計(jì)算機(jī)豐富、自然和有效交互的多模態(tài)界面(multimodalinterfaces),感知界面尋找感知輸入sensing(input)和繪制輸出rendering(output)的杠桿技術(shù)以提供利用標(biāo)準(zhǔn)界面及諸如鍵盤、鼠標(biāo)和其它監(jiān)視器等公共I/O設(shè)備[177]所不能實(shí)現(xiàn)的交互,并使得計(jì)算機(jī)視覺成為很多情況下的核心要件(centralcomponent);注意界面(Attentiveinterfaces)[180]是依賴于人的注意力作為主要輸入的上下文敏感界面(context-awareinterfaces)[160],即:注意界面120]采用收集到的信息來評(píng)估與用戶通信的最佳時(shí)間和方法;由于注意力主要由眼睛注視(eyecontact)[160]和姿態(tài)(gestures"盡管諸如鼠標(biāo)移動(dòng)等度量也指示)來體現(xiàn),計(jì)算機(jī)視覺在注意界面中起著主要作用?;钴S界面(Enactiveinterfaces)是那些幫助用戶以手或身體的主動(dòng)使用為基礎(chǔ)來與一組理解任務(wù)的知識(shí)通信(helpuserscommunicateaformofknowledgebasedontheactiveuseofthehandsorbodyforapprehensiontasks),活躍知識(shí)不只是簡單的多感知中間知識(shí)(multisensorymediatedknowledge),而是以機(jī)器響應(yīng)(motorresponses)的形式貯存,并由“Doing”動(dòng)作來獲取。典型的例子是諸如打字、駕駛汽車、跳舞、彈奏樂器和粘土制模等任務(wù)所需要的能力,所有這些任務(wù)是難以采用圖標(biāo)或符號(hào)形式來描述的。3.Human-centeredvisionWeclassifyvisiontechniquesforMMHCIusingahuman-centeredapproachandwedividethemaccordingtothehumanbody:(1)large-scalebodymovements,(2)handgestures,and(3)gaze.Wemakeadistinctionbetweencommand(actionscanbeusedtoexplicitlyexecutecommands:selectmenus,etc.)andnon-commandinterfaces(actionsoreventsusedtoindirectlytunethesystemtotheuser'sneeds)[111,23].Ingeneral,vision-basedhumanmotionanalysissystemsusedforMMHCIcanbethoughtofashavingmainlyfourstages:(1)motionsegmentation,(2)objectclassification,(3)tracking,and(4)interpretation.Whilesomeapproachesusegeometricprimitivestomodeldifferentcomponents(e.g.,cylindersusedtomodellimbs,head,andtorsoforbodymovements,orforhandandfingersingesturerecognition),othersusefeaturerepresentationsbasedonappearance(appearance-basedmethods).Inthefirstapproach,externalmarkersareoftenusedtoestimatebodypostureandrelevantparameters.Whilemarkerscanbeaccurate,theyplacerestrictionsonclothingandrequirecalibration,sotheyarenotdesirableinmanyapplications.Moreover,theattempttofitgeometricshapestobodypartscanbecomputationallyexpensiveandthesemethodsareoftennotsuitableforreal-timeprocessing.Appearancebasedmethods,ontheotherhand,donotrequiremarkers,butrequiretraining(e.g.,withmachinelearning,probabilisticapproaches,etc.).Sincetheydonotrequiremarkers,theyplacefewerconstraintsontheuserandarethereforemoredesirable.Next,webrieflydiscusssomespecifictechniquesforbody,gesture,andgaze.Themotionanalysisstepsaresimilar,sothereissomeinevitableoverlapinthediscussions.Someoftheissuesforgesturerecognition,forinstance,applytobodymovementsandgazedetection.3.1.Large-scalebodymovementsTrackingoflarge-scalebodymovements(head,arms,torso,andlegs)isnecessarytointerpretposeandmotioninmanyMMHCIapplications.However,sinceextensivesurveyshavebeenpublishedinthisarea[182,56,107,183],wediscussthetopicbriefly.Therearethreeimportantissuesinarticulatedmotionanalysis[188]:representation(jointanglesormotionofallthesub-parts),computationalparadigms(deterministicorprobabilistic),andcomputationreduction.BodypostureanalysisisimportantinmanyMMHCIapplications.Forexample,in[172],theauthorsuseastereoandthermalinfraredvideosystemtoestimatethedriver'sposturefordeploymentofsmartairbags.Theauthorsof[148]proposeamethodforrecoveringarticulatedbodyposewithoutinitializationandtracking(usinglearning).Theauthorsof[8]useposeandvelocityvectorstorecognizebodypartsanddetectdifferentactivities,whiletheauthorsof[17]usetemporaltemplates.InsomeemergingMMHCIapplications,groupandnon-commandactionsplayanimportantrole.In[102],visualfeaturesareextractedfromheadandhand/forearmblobs:theheadblobisrepresentedbytheverticalpositionofitscentroid,andhandblobsarerepresentedbyeccentricityandanglewithrespecttothehorizontal.Thesefeaturestogetherwithaudiofeatures(e.g.,energy,pitch,andspeakingrate,amongothers)areusedforsegmentingmeetingvideosaccordingtoactionssuchasmonologue,presentation,white-board,discussion,andnotetaking.Theauthorsof[60]useonlycomputervision,butmakeadistinctionbetweenbodymovements,events,andbehaviors,withinarule-basedsystemframework.Importantissuesforlarge-scalebodytrackingincludewhethertheapproachuses2Dor3D,desiredaccuracy,speed,occlusionandotherconstraints.Someoftheissuespertainingtogesturerecognition,discussednext,canalsoapplytobodytracking.HandgesturerecognitionAlthoughinhuman一humancommunicationgesturesareoftenperformedusingavarietyofbodyparts(e.g.,arms,eyebrows,legs,entirebody,etc.),mostresearchersincomputervisionusethetermgesturerecognitiontoreferexclusivelytohandgestures.Wewillusethetermaccordinglyandfocusonhandgesturerecognitioninthissection.Psycholinguisticstudiesofhuman-to-humancommunication[103]describegesturesasthecriticallinkbetweenourconceptualizingcapacitiesandourlinguisticabilities.Humansuseaverywidevarietyofgesturesrangingfromsimpleactionsofusingthehandtopointatobjects,tothemorecomplexactionsthatexpressfeelingsandallowcommunicationwithothers.Gesturesshould,therefore,playanessentialroleinMMHCI[83,186,52],astheyseemintrinsictonaturalinteractionbetweenthehumanandthecomputer-controlledinterfaceinmanyapplications,rangingfromvirtualenvironments[82]andsmartsurveillance[174],toremotecollaborationapplications[52].Thereareseveralimportantissuesthatshouldbeconsideredwhendesigningagesturerecognitionsystem[136].Thefirstphaseofarecognitiontaskischoosingamathematicalmodelthatmayconsiderboththespatialandthetemporalcharacteristicsofthehandandhandgestures.Theapproachusedformodelingplaysacrucialroleinthenatureandperformanceofgestureinterpretation.Typically,featuresareextractedfromtheimagesorvideo,andoncethesefeaturesareextracted,modelparametersareestimatedbasedonsubsetsofthemuntilarightmatchisfound.Forexample,thesystemmightdetectnpointsandattempttodetermineifthesenpoints(orasubsetofthem)couldmatchthecharacteristicsofpointsextractedfromahandinaparticularposeorperformingaparticularaction.Theparametersofthemodelarethenadescriptionofthehandposeortrajectoryanddependonthemodelingapproachused.Amongtheimportantproblemsinvolvedintheanalysisarehandlocalization[187],handtracking[194],andtheselectionofsuitablefeatures[83].Aftertheparametersarecomputed,thegesturesrepresentedbythemneedtobeclassifiedandinterpretedbasedontheacceptedmodelandbasedonsomegrammarrulesthatreflecttheinternalsyntaxofgesturalcommands.Thegrammarmayalsoencodetheinteractionofgestureswithothercommunicationmodessuchasspeech,gaze,orfacialexpressions.Asanalternativetomodeling,someauthorshaveexploredtheuseofcombinationsofsimple2Dmotionbaseddetectorsforgesturerecognition[71].Inanycase,tofullyexploitthepotentialofgesturesforanMMHCIapplication,theclassofpossiblerecognizedgesturesshouldbeasbroadaspossibleandideallyanygestureperformedbytheusershouldbeunambiguouslyinterpretablebytheinterface.However,mostofthegesture-basedHCIsystemsallowonlysymboliccommandsbasedonhandpostureor3Dpointing.Thisisduetothecomplexityassociatedwithgestureanalysisandthedesiretobuildreal-timeinterfaces.Also,mostofthesystemsaccommodateonlysingle-handgestures.Yet,humangestures,especiallycommunicativegestures,naturallyemployactionsofbothhands.However,ifthetwo-handgesturesaretobeallowed,severalambiguoussituationsmayappear(e.g.,occlusionofhands,intentionalvs.unintentional,etc.)andtheprocessingtimewilllikelyincrease.Anotherimportantaspectthatisincreasinglyconsideredistheuseofothermodalities(e.g.,speech)toaugmenttheMMHCIsystem[127,162].TheuseofsuchmultimodalapproachescanreducethecomplexityandincreasethenaturalnessoftheinterfaceforMMHCI[126].GazedetectionGaze,definedasthedirectiontowhichtheeyesarepointinginspace,isastrongindicatorofattention,andithasbeenstudiedextensivelysinceasearlyas1879inpsychology,andmorerecentlyinneuroscienceandincomputingapplications[41].Whileearlyeyetrackingresearchfocusedonlyonsystemsforin-labexperiments,manycommercialandexperimentalsystemsareavailabletodayforawiderangeofapplications.Eyetrackingsystemscanbegroupedintowearableornon-wearable,andinfrared-basedorappearance-based.Ininfrared-basedsystems,alightshiningonthesubjectwhosegazeistobetrackedcreatesa‘‘red-eyeeffect:''thedifferenceinreflectionbetweenthecorneaandthepupilisusedtodeterminethedirectionofsight.Inappearancebasedsystems,computervisiontechniquesareusedtofindtheeyesintheimageandthendeterminetheirorientation.Whilewearablesystemsarethemostaccurate(approximateerrorratesbelow1.4_vs.errorsbelow1.7_fornonwearableinfrared),theyarealsothemostintrusive.Infraredsystemsaremoreaccuratethanappearance-based,butthereareconcernsoverthesafetyofprolongedexposuretoinfraredlights.Inaddition,mostnon-wearablesystemsrequire(oftencumbersome)calibrationforeachindividual[108,121].Appearance-basedsystemsusuallycapturebotheyesusingtwocamerastopredictgazedirection.Duetothecomputationalcostofprocessingtwostreamssimultaneously,theresolutionoftheimageofeacheyeisoftensmall.Thismakessuchsystemslessaccurate,althoughincreasingcomputationalpowerandlowercostsmeanthatmorecomputationallyintensivealgorithmscanberuninrealtime.Asanalternative,in[181],theauthorsproposeusingasinglehigh-resolutionimageofoneeyetoimproveaccuracy.Ontheotherhand,infrared-basedsystemsusuallyuseonlyonecamera,buttheuseoftwocamerashasbeenproposedtofurtherincreaseaccuracy[152].Althoughmostresearchonnon-wearablesystemshasfocusedondesktopusers,theubiquityofcomputingdeviceshasallowedforapplicationinotherdomainsinwhichtheuserisstationary(e.g.,[168,152]).Forexample,theauthorsof[168]monitordrivervisualattentionusingasingle,non-wearablecameraplacedonacar'sdashboardtotrackfacefeaturesandforgazedetection.Wearableeyetrackershavealsobeeninvestigatedmostlyfordesktopapplications(orforusersthatdonotwalkwearingthedevice).Also,becauseofadvancesinhardware(e.g.,reductioninsizeandweight)andlowercosts,researchershavebeenabletoinvestigateusesinnovelapplications.Forexample,in[193],eyetrackingdataarecombinedwithvideofromtheuser'sperspective,headdirections,andhandmotionstolearnwordsfromnaturalinteractionswithusers;theauthorsof[137]useawearableeyetrackertounderstandhand-eyecoordinationinnaturaltasks,andtheauthorsof[38]useawearableeyetrackertodetecteyecontactandrecordvideoforblogging.Themainissuesindevelopinggazetrackingsystemsareintrusiveness,speed,robustness,andaccuracy.Thetypeofhardwareandalgorithmsnecessary,however,dependhighlyonthelevelofanalysisdesired.Gazeanalysiscanbeperformedatthreedifferentlevels[23]:(a)highlydetailedlow-levelmicro-events,(b)low-levelintentionalevents,and(c)coarse-levelgoal-basedevents.Micro-eventsincludemicro-saccades,jitter,nystagmus,andbrieffixations,whicharestudiedfortheirphysiologicalandpsychologicalrelevancebyvisionscientistsandpsychologists.Low-levelintentionaleventsarethesmallestcoherentunitsofmovementthattheuserisawareofduringvisualactivity,whichincludesustainedfixationsandrevisits.AlthoughmostoftheworkonHCIhasfocusedoncoarse-levelgoalbasedevents(e.g.,usinggazeasapointer[165]),itiseasytoforeseetheimportanceofanalysisatlowerlevels,particularlytoinfertheuser'scognitivestateinaffectiveinterfaces(e.g.,[62]).Withinthiscontext,animportantissueoftenoverlookedishowtointerpreteye-trackingdata.Inotherwords,astheusermoveshiseyesduringinteraction,thesystemmustdecidewhatthemovementsmeaninordertoreactaccordingly.Wemoveoureyes2-3timespersecond,soasystemmayhavetoprocesslargeamountsofdatawithinashorttime,ataskthatisnottrivialevenifprocessingdoesnotoccurinreal-time.Onewaytointerpreteyetrackingdataistoclusterfixationpointsandassume,forinstance,thatclusterscorrespondtoareasofinterest.Clusteringoffixationpointsisonlyoneoption,however,andastheauthorsof[154]discuss,itcanbedifficulttodeterminetheclusteringalgorithmparameters.Otheroptionsincludeobtainingstatisticsonmeasuressuchasnumberofeyemovements,saccades,distancesbetweenfixations,orderoffixations,andsoon.4.Affectivehuman-computerinteractionMostcurrentMMHCIsystemsdonotaccountforthefactthathuman-humancommunicationisalwayssociallysituatedandthatweuseemotiontoenhanceourcommunication.However,sinceemotionisoftenexpressedinamultimodalway,itisanimportantareaforMMHCIandwewilldiscussitinsomedetail.HCIsystemsthatcansensetheaffectivestatesofthehuman(e.g.,stress,inattention,anger,boredom,etc.)andarecapableofadaptingandrespondingtotheseaffectivestatesarelikelytobeperceivedasmorenatural,efficacious,andtrustworthy.Inherbook,Picar[140]suggestedseveralapplicationswhereitisbeneficialforcomputerstorecognizehumanemotions.Forexample,knowingtheuser'semotions,thecomputercanbecomeamoreeffectivetutor.Syntheticspeechwithemotionsinthevoicewouldsoundmorepleasingthanamonotonousvoice.Computeragentscouldlearntheuser'spreferencesthroughtheusers'emotions.Anotherapplicationistohelpthehumanusersmonitortheirstresslevel.Inclinicalsettings,recognizingaperson'sinabilitytoexpresscertainfacialexpressionsmayhelpdiagnoseearlypsychologicaldisorders.TheresearchareaofmachineanalysisandemploymentofhumanemotiontobuildmorenaturalandflexibleHCIsystemsisknownbythegeneralnameofaffectivecomputing[140].Thereisavastbodyofliteratureonaffectivecomputingandemotionrecognition[67,132,140,133].Emotionisintricatelylinkedtootherfunctionssuchasattention,perception,memory,decision-making,andlearning[43].Thissuggeststhatitmaybebeneficialforcomputerstorecognizetheuser'semotionsandotherrelatedcognitivestatesandexpressions.Addressingtheproblemofaffectivecommunication,Bianchi-BerthouzeandLisetti[14]identifiedthreekeypointstobeconsideredwhendevelopingsystemsthatcaptureaffectiveinformation:embodiment(experiencingphysicalreality),dynamics(mappingtheexperienceandtheemotionalstateontoatemporalprocessandaparticularlabel),andadaptiveinteraction(conveyingemotiveresponse,respondingtoarecognizedemotionalstate).Researchersusemainlytwodifferentmethodstoanalyzeemotions[133].Oneapproachistoclassifyemotionsintodiscretecategoriessuchasjoy,fear,love,surprise,sadness,etc.,usingdifferentmodalitiesasinputs.Theproblemisthatthestimulimaycontainblendedemotionsandthechoiceofthesecategoriesmaybetoorestrictive,orculturallydependent.Anotherwayistohavemultipledimensionsorscalestodescribeemotions.Twocommonscalesarevalenceandarousa[61].Valencedescribesthepleasantnessofthestimuli,withpositiveorpleasant(e.g.,happiness)ononeend,andnegativeorunpleasant(e.g.,disgust)ontheother.Theotherdimensionisarousaloractivation.Forexample,sadnesshaslowarousal,whereassurprisehasahigharousallevel.Thedifferentemotionallabelscouldbeplottedatvariouspositionsona2Dplanespannedbythesetwoaxestoconstructa2Demotionmodel[88,60].Facialexpressionsandvocalemotionsareparticularlyimportantinthiscontext,sowediscusstheminmoredetailbelow.FacialexpressionrecognitionMostfacialexpressionrecognitionresearch(see[131]and[47]fortwocomprehensivereviews)hasbeeninspiredbytheworkofEkman[43]oncodingfacialexpressionsbasedonthebasicmovementsoffacialfeaturescalledactionunits(AUs).Inordertoofferacomprehensivedescriptionofthevisiblemusclemovementintheface,EkmanproposedtheFacialActionCodingSystem(FACS).Inthesystem,afacialexpressionisahighleveldescriptionoffacialmotionsrepresentedbyregionsorfeaturepointscalledactionunits.EachAUhassomerelatedmuscularbasisandagivenfacialexpressionmaybedescribedbyacombinationofAUs.Somemethodsfollowafeature-basedapproach,whereonetriestodetectandtrackspecificfeaturessuchasthecornersofthemouth,eyebrows,etc.Othermethodsusearegion-basedapproachinwhichfacialmotionsaremeasuredincertainregionsonthefacesuchastheeye/eyebrowandthemouth.Inaddition,wecandistinguishtwotypesofclassificationschemes:dynamicandstatic.Staticclassifiers(e.g.,BayesianNetworks)classifyeachframeinavideotooneofthefacialexpressioncategoriesbasedontheresultsofaparticularvideoframe.Dynamicclassifiers(e.g.,HMM)useseveralvideoframesandperformclassificationbyanalyzingthetemporalpatternsoftheregionsanalyzedorfeaturesextracted.Dynamicclassifiersareverysensitivetoappearancechangesinthefacialexpressionsofdifferentindividualssotheyaremoresuitedforperson-dependentexperiments[32].Staticclassifiers,ontheotherhand,areeasiertotrainandingeneralneedlesstrainingdatabutwhenusedonacontinuousvideosequencetheycanbeunreliableespeciallyforframesthatarenotatthepeakofanexpression.Mase[99]wasoneofthefirsttouseimageprocessingtechniques(opticalflow)torecognizefacialexpressions.Lanitisetal.[90]usedaflexibleshapeandappearancemodelforimagecoding,personidentification,poserecovery,genderrecognition,andfacialexpressionrecognition.BlackandYacoob[15]usedlocalparameterizedmodelsofimagemotiontorecovernon-rigidmotion.Oncerecovered,theseparametersarefedtoarule-basedclassifiertorecognizethesixbasicfacialexpressions.YacoobandDavis[189]computedopticalflowandusedsimilarrulestoclassifythesixfacialexpressions.Rosenblumetal.[149]alsocomputedopticalflowofregionsontheface,thenappliedaradialbasisfunctionnetworktoclassifyexpressions.EssaandPentland[45]alsousedanopticalflowregion-basedmethodtorecognizeexpressions.OtsukaandOhya[117]firstcomputeopticalflow,thencompute2DFouriertransformcoefficients,whichwerethenusedasfeaturevectorsforahiddenMarkovmodel(HMM)toclassifyexpressions.Thetrainedsystemwasabletorecognizeoneofthesixexpressionsnearreal-time(about10Hz).Furthermore,theyusedthetrackedmotionstocontrolthefacialexpressionofananimatedKabukisystem[118].Asimilarapproach,usingdifferentfeatureswasusedbyLien[93].NefianandHayes[110]proposedanembeddedHMMapproachforfacerecognitionthatusesanefficientsetofobservationvectorsbasedonDCTcoefficients.Martinez[98]introducedanindexingapproachbasedontheidentificationoffrontalfaceimagesunderdifferentilluminationconditions,facialexpressions,andocclusions.ABayesianapproachwasusedtofindthebestmatchbetweenthelocalobservationsandthelearnedlocalfeaturesmodelandanHMMwasemployedtoachievegoodrecognitionevenwhenthenewconditionsdidnotcorrespondtotheconditionspreviouslyencounteredduringthelearningphase.Oliveretal.[116]usedlowerfacetrackingtoextractmouthshapefeaturesandusedthemasinputstoanHMMbasedfacialexpressionrecognitionsystem(recognizingneutral,happy,sad,andanopenmouth).Chen[28]usedasuiteofstaticclassifierstorecognizefacialexpressions,reportingonbothperson-dependentandperson-independentresults.Inspiteofthevarietyofapproachestofacialaffectanalysis,themajoritysufferfromthefollowinglimitations[132]:?handleasmallsetofposedprototypicalfacialexpressionsofsixbasicemotionsfromportraitsornearlyfrontalviewsoffaceswithnofacialhairorglassesrecordedunderconstantillumination;?donotperformacontext-dependentinterpretationofshownfacialbehavior;?donotanalyzeextractedfacialinformationondifferenttimescales(shortvideosarehandledonly);consequently,inferencesabouttheexpressedmoodandattitude(largertimescales)cannotbemadebycurrentfacialaffectanalyzers.EmotioninaudioThevocalaspectofacommunicativemessagecarriesvariouskindsofinformation.Ifwedisregardthemannerinwhichamessageisspokenandconsideronlythetextualcontent,wearelikelytomisstheimportantaspectsoftheutteranceandwemightevencompletelymisunderstandthemeaningofthemessage.Nevertheless,incontrasttospokenlanguageprocessing,whichhasrecentlywitnessedsignificantadvances,theprocessingofemotionalspeechhasnotbeenwidelyexplored.Startinginthe1930s,quantitativestudiesofvocalemotionshavehadalongerhistorythanquantitativestudiesoffacialexpressions.Traditionalaswellasmostrecentstudiesonemotionalcontentsinspeech(see[119,109,72,155])use‘‘prosodic''information,thatisinformationonintonation,rhythm,lexicalstress,andotherfeaturesinspeech.Thisisextractedusingmeasuressuchaspitch,duration,andintensityoftheutterance.Recentstudiesuse‘‘Ekman'ssix''basicemotions,althoughothersinthepasthaveusedmanymorecategories.Thereasonsforusingthesebasiccategoriesareoftennotjustifiedsinceitisnotclearwhetherthereexist‘‘universal''emotionalcharacteristicsinthevoiceforthesesixcategories[27].Thelimitationsofexistingvocal-affectanalyzersare[132]:?performsingularclassificationofinputaudiosignalsintoafewemotioncategoriessuchasanger,irony,happiness,sadness/grief,fear,disgust,surpriseandaffection;?donotperformacontext-sensitiveanalysis(environment-,user-andtask-dependentanalysis)oftheinputaudiosignal;?donotanalyzeextractedvocalexpressioninformationondifferenttimescales(proposedinter-audio-frameanalysesareusedeitherforthedetectionofsupra-segmentalfeatures,suchasthepitchandintensityoverthedurationofasyllableorword,orforthedetectionofphoneticfeatures)—inferencesaboutmoodsandattitudes(longertimescales)aredifficulttomakebasedonthecurrentvocal-affectanalyzers;?adoptstrongassumptions(e.g.,therecordingsarenoisefree,therecordedsentencesareshort,delimitedbypausesetc.)andusethetestdatasetsthataresmall(oneormorewordsoroneormoreshortsentencesspokenbyfewsubjects)containingexaggeratedvocalexpressionsofaffectivestates.MultimodalapproachestoemotionrecognitionThemostsurprisingissueregardingthemultimodalaffectrecognitionproblemisthatalthoughrecentadvancesinvideoandaudioprocessingcouldmakethemultimodalanalysisofhumanaffectivestatetractable,thereareonlyafewresearchefforts[80,159,195,157]thathavetriedtoimplementamultimodalaffectiveanalyzer.Althoughstudiesinpsychologyontheaccuracyofpredictionsfromobservationsofexpressivebehaviorsuggestthatthecombinedfaceandbodyapproachesarethemostinformative[4,59],withtheexceptionofatentativeattemptofBalomenosetal.[7],thereisvirtuallynoothereffortreportedonautomatichumanaffectanalysisfromcombinedfaceandbodygestures.Inthesameway,studiesinfacialexpressionrecognitionandvocalaffectrecognitionhavebeendonelargelyindependentofeachother.Mostworksinfacialexpressionrecognitionusestillphotographsorvideosequenceswithoutspeech.Similarly,worksonvocalemotiondetectionoftenuseonlyaudioinformation.AlegitimatequestionthatshouldbeconsideredinMMHCIishowmuchinformationdoestheface,ascomparedtospeech,andbodymovement,contributetonaturalinteraction.Mostexperimenterssuggestthatthefaceismoreaccuratelyjudged,produceshigheragreement,orcorrelatesbetterwithjudgmentsbasedonfullaudiovisualinputthanonvoiceinput[104,195].ExamplesofexistingworkscombiningdifferentmodalitiesintoasinglesystemforhumanaffectivestateanalysisarethoseofChen[27],Yoshitomietal.[192],DeSilvaandNg[166],Goetal.[57],andSongetal.[169],whoinvestigatedtheeffectsofacombineddetectionoffacialandvocalexpressionsofaffectivestates.Inbrief,theseworksachieveanaccuracyof72-85%whendetectingoneormorebasicemotionsfromcleanaudiovisualinput(e.g.,noise-freerecordings,closelyplacedmicrophone,non-occludedportraits)fromanactorspeakingasinglewordandshowingexaggeratedfacialdisplaysofabasicemotion.Althoughaudioandimageprocessingtechniquesinthesesystemsarerelevanttothediscussiononthestateoftheartinaffectivecomputing,thesystemsthemselveshavemostofthedrawbacksofunimodalaffectanalyzers.ManyimprovementsareneededifthosesystemsaretobeusedformultimodalHCIwherecleaninputfromaknownactor/announcercannotbeexpectedandcontextindependent,separateprocessing,andinterpretationofaudioandvisualdatadoesnotsuffice.5.Modeling,fusion,anddatacollectionMultimodalinterfacedesign[146]isimportantbecausetheprinciplesandtechniquesusedintradit
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 商鋪裝修合同樣本
- 水暖電消防承包合同范本
- 2025農(nóng)作物種子買賣合同范本
- 柴油運(yùn)輸合同范本年
- 演出化妝服務(wù)合同
- 范文二手房買賣定金合同
- 委托合同行紀(jì)合同居間合同
- 2025【合同范本】房屋土建施工合同范本
- 2024年中考物理(廣州卷)真題詳細(xì)解讀及評(píng)析
- 簡單的櫥柜合同范本
- 公眾聚集場所消防技術(shù)標(biāo)準(zhǔn)要點(diǎn)
- 幼兒園員工手冊(cè)與規(guī)章制度
- 社團(tuán)活動(dòng)經(jīng)費(fèi)預(yù)算申請(qǐng)表
- 經(jīng)營范圍登記規(guī)范表述目錄(試行)(V1.0.2版)
- 2023年山東省威海市中考物理真題(附答案詳解)
- 第八講 發(fā)展全過程人民民主PPT習(xí)概論2023優(yōu)化版教學(xué)課件
- 王崧舟:學(xué)習(xí)任務(wù)群與課堂教學(xué)變革 2022版新課程標(biāo)準(zhǔn)解讀解析資料 57
- 招投標(biāo)現(xiàn)場項(xiàng)目經(jīng)理答辯(完整版)資料
- 運(yùn)動(dòng)競賽學(xué)課件
- 2022年上海市初中畢業(yè)數(shù)學(xué)課程終結(jié)性評(píng)價(jià)指南
- 高考作文備考-議論文對(duì)比論證 課件14張
評(píng)論
0/150
提交評(píng)論