跨語言信息檢索技術(shù)_第1頁
跨語言信息檢索技術(shù)_第2頁
跨語言信息檢索技術(shù)_第3頁
跨語言信息檢索技術(shù)_第4頁
跨語言信息檢索技術(shù)_第5頁
已閱讀5頁,還剩72頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

11三月2024跨語言信息檢索技術(shù)RoadMapCrossLingualIRMotivationDefinitionGeneralIssuesWithCLIRBasicApproachestoCLIRCLIRevaluationCLIRapplicationsInformationRetrievalSinglelanguage:boththeuser’squeryanddocumentstobesearchedareinsamelanguage.Crosslanguage:documentswritteninalanguagedifferentfromthelanguageoftheuser'squerydocumentsquery3/11/202432000-2010年世界各大洲網(wǎng)絡(luò)語言使用增長率(數(shù)據(jù)更新時(shí)間:2010年6月30日)TheInternetBigPictureWorldRegionsPopulationInternetUsersPenetration(%population)Users%ofTableGrowth2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,14338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%MiddleEast236,137,235115,823,88249.0%3.5%3,426%NorthAmerica357,172,209313,862,86387.9%9.6%191%LatinAmerica617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%WorldTotal7260,621,1183,270,490,58445%100%806%WorldInternetUsersand2015PopulationStats3/11/202443/11/202453/11/20246Usageofcontentlanguagesforwebsites20022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese2%Chinese2.2%Korean1%Italian2.1%Russian1%Polish1.9%Portuguese1%Turkish1.6%Source:/technologies/overview/content_language/all/research/activities/wcp/stats/intnl.html3/11/20247CrossLanguageIRMotivationInformationunavailabilityinsomelanguagesLanguagebarrierDefinition:Cross-languageinformationretrieval(CLIR)

isasubfieldof

informationretrieval

dealingwithretrievinginformationwritteninalanguagedifferentfromthelanguageoftheuser'squery(wikipedia)Example:AusermayaskqueryinChinesebutretrieverelevantdocumentswritteninEnglish.WhydoweneedCLIRsystems?Needstechnologiesthatenableaccesstoinforegardlessofgeographic/languagebarriers.Tofind,retrieveandunderstandrelevantinformationinwhateverlanguage/form.CLIRhasbecomeoneofthekeyfactorsaffectingknowledgesharingallovertheworld.

GeneralIssuesWithCLIRMultilingualtextaccess(charactersets,etc.)Differencesbetweenlanguages

-stemming,compoundwords,breaksbetweenwords,etc.TermambiguitybetweenlanguagesWhattotranslate(queryvs.document)andhowMatchingstrategiesNotranslation(1)CognatematchingTranslation(2)Querytranslation(3)Documenttranslation(4)Interlingualtechniques3/11/202411Cognatematching(同源匹配)Inthecaseofthemostnaivecognatematching,untranslatabletermssuchaspropernounsortechnicalterminologyareleftunchangedthroughthestageoftranslation.Theunchangedtermcanbeexpectedtomatchsuccessfullywithacorrespondingterminanotherlanguageifthetwolanguageshaveacloselinguisticrelationship.(forexample,generationinEnglishandFrench)Whentwolanguagesareverydifferent,byexploringamethodformeasuringsimilaritybetweentransliterationanditsoriginalword,wemaymakecognatematchingfeasible(音譯)..3/11/202412Querytranslation搜索引擎翻譯系統(tǒng)法語查詢法語文檔結(jié)果中文查詢選擇瀏覽法語文檔集合過程:將中文查詢翻譯成法語檢索法語文檔集合將檢索結(jié)果翻譯成中文3/11/202413querytranslationQuerytranslationisthemostwidelyusedmatchingstrategyforCLIRduetoitstractability.theretrievalsystemdoesnothavetochangeitsinvertedfilesofindextermsinanywayagainstqueriesinanylanguage.ItislesscomputationallycostlytoprocessthetranslationofaquerythanthatofalargesetofdocumentsChallenge:termambiguity‘queriesareoftenshortandshortqueriesprovidelittlecontextfordisambiguation’Termdisambiguationwillbediscussedlater.3/11/202414查詢翻譯優(yōu)缺點(diǎn)優(yōu)點(diǎn)簡單容易操作靈活節(jié)約時(shí)間、空間,效率高缺點(diǎn)缺乏上下文對于短查詢式,翻譯歧義性大3/11/202415Documenttranslation中文查詢法語文檔集合搜索引擎翻譯系統(tǒng)中文文檔集合結(jié)果選擇瀏覽過程:將整個(gè)法語文檔翻譯成中文文檔直接用中文文檔檢索3/11/202416DocumenttranslationDocumenttranslationhasoppositeadvantagesanddisadvantagesfromquerytranslation.InCLIRexperiments,thisapproachisnotusuallyutilized,andquerytranslationisdominant.However,someresearchershaveusedittotranslatelargesetsofdocumentssincemorevariedcontextwithineachdocumentisavailablefortranslation,whichcanimprovetranslationquality.OardandHackett(1998)reportedthatautomaticmachinetranslationofasetofdocumentsusingacommercialMTsystemoutperformsquerytranslationinanexperimentofCLIRfromGermantoEnglish3/11/202417文檔翻譯優(yōu)缺點(diǎn)優(yōu)點(diǎn)只翻譯一次文檔提供的上下文比較豐富文檔可以線下事先翻譯好缺點(diǎn)翻譯速度慢占用大量空間、時(shí)間,效率低依賴機(jī)器翻譯系統(tǒng)的質(zhì)量3/11/202418查詢翻譯vs.文檔翻譯取決于特定語言資源通常查詢翻譯使用更廣兩種方法都提出了“交互性”挑戰(zhàn)3/11/202419Interlingualapproachanintermediatespaceofsubjectrepresentationintowhichboththequeryandthedocumentsareconvertedisusedtocomparethem.Onetypeofinterlingualapproachistousethe‘‘synsets’’providedinWordNet,whichisawellknownmachine-readablethesaurus.Forexample,Diekema,Oroumchian,Sheridan,andLiddy(1999)employedtheWordNetsynsetnumbersaslanguage-independentrepresentationsforCLIR.Sinceasynsetnumber(label)representingaconceptiscorrespondedtoasetofconcretewordsineachoflanguagessupported(e.g.,EnglishandFrench),itispossiblethataqueryterminthesourcelanguagesislinkedtowordsinthetargetlanguageviathesynsetnumber.3/11/202420TranslationtechniquesDictionary-basedmethodsParallelcorpora-basedmethodUseofWWWresources3/11/202421Dictionary-basedmethodsUsingabilingualMachineReadableDictionary(MRD).mostretrievalsystemsarestillbasedonso-called‘‘bag-of-words’’architectures,inwhichbothquerystatementsanddocumenttextsaredecomposedintoasetofwords(orphrases)throughaprocessofindexing.Thuswecantranslateaqueryeasilybyreplacingeachquerytermwithitstranslationequivalentsappearinginabilingualdictionaryorabilingualtermlist.3/11/202422bilingualdictionary人工構(gòu)建的雙語詞典printedMerriam-Webster'sDictionariesLongmanDictionarieselectronicFreedictat/

Travlangat/

問題HastobeprocessedtobereadablebymachineLimitedvocabularyDictionarytranslationsareinherentlyambiguousandaddextraneousinformation機(jī)器自動(dòng)構(gòu)建的詞典稱為機(jī)讀詞典MachineReadableDictionaries(MRD)3/11/202423Termtranslationoilpetroleumprobesurveytakesamples選哪個(gè)翻譯?沒有翻譯!restraincymbidiumgoeringii分詞錯(cuò)誤oilpetroleumprobesurveytakesamples3/11/202424SomeissuesintermtranslationCompoundwords,forexampleGermandecompositionNoboundarybetweenwords,e.g.ChinesesegmentationSpecializedvocabularynotcontainedinthedictionary,e.g.namedentity3/11/202425ExamplesCompounddecomposition(復(fù)合詞分解)chinesewordsegmentation新西蘭花新西蘭花 NewZealandflowers新西蘭花 freshbroccolis3/11/202426Corpora-basedmethodParallel(雙語平行語料庫)orcomparablecorpora(雙語可比語料庫)areusefulresourcesenablingustoextractbeneficialinformationforCLIR.Forexample,inordertotranslateEnglishqueriesintoSpanish,DavisandDunning(1995)extractedmoderatelyfrequentSpanishtermsfromSpanishdocumentsalignedwithEnglishdocumentswhichhadbeensearchedusinganEnglishquery(sourcequery).3/11/202427ParallelcorporaAparallelcorpus(pl.corpora)isadocumentcollectioncomposedoftwoormoredisjointsubsets,eachwritteninadifferentlanguage,suchthatdocumentsineachsubsetaretranslationsofdocumentsineachothersubset.Veryhighaccuracy3/11/202428象形文字古埃及文字希臘文3/11/202429羅塞塔石碑羅塞塔石碑(RosettaStone,也譯作羅塞達(dá)碑),高1.14米,寬0.73米,是一塊制作于公元前196年的大理石石碑,原本是一塊刻有埃及國王托勒密五世(PtolemyV)詔書的石碑。石碑上用希臘文字、古埃及文字和當(dāng)時(shí)的通俗體文字刻了同樣的內(nèi)容。由于這塊石碑刻有三種不同語言版本,使得近代的考古學(xué)家得以有機(jī)會對照各語言版本的內(nèi)容后,解讀出已經(jīng)失傳千余年的埃及象形文之意義與結(jié)構(gòu),而成為今日研究古埃及歷史的重要里程碑。3/11/202430Moreparallelcorporanews:DE-News(German-English)Hong-KongNews,XinhuaNews(Chinese-English)Governmentdocuemtns:Canadian-Hansards(French-English)Europarl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)UNTreaties(Russian,English,Arabic,…)Bible(many,manylanguages)3/11/202431ExamplesEnglishGermanDivergingopinionsaboutplannedtaxreformUnterschiedlicheMeinungenzurgeplantenSteuerreformThediscussionaroundtheenvisagedmajortaxreformcontinues.DieDiskussionumdievorgesehenegrosseSteuerreformdauertan.TheFDPeconomicsexpert,GrafLambsdorff,todaycameoutinfavorofadvancingtheenactmentofsignificantpartsoftheoverhaul,currentlyplannedfor1999.DerFDP-WirtschaftsexperteGrafLambsdorffsprachsichheutedafueraus,wesentlicheTeilederfuer1999geplantenReformvorzuziehen.3/11/202432ComparablecorporaAcomparablecorpusisapairofcorporaintwodifferentlanguages,whichcomefromthesamedomain.TalkingthesametopicParallelsentencesmayalsobeminedfromcomparablecorporasuchasnewsstorieswrittenonthesametopicindifferentlanguages.Someresearchersextractphrasepairsfromcomparablecorporausingaclassifierapproach.3/11/202433Example3/11/202434UseofWWWresourcesTheWWWcanproviderichandubiquitousmachine-readableresources,fromwhichwemaybeabletoautomaticallyextractinformationusefulforCLIR.Forexample,Chen(2002)andChenandGey(2003)madeuseofageneralsearchengineontheInternetandtriedtofindEnglishtranslationequivalentsofChineseorJapaneseterms(mainlypropernouns)byanalyzingcontextsofthesetermsinChineseandJapaneseWebdocumentsreturnedbytheengine.3/11/202435Termdisambiguationtechniques(翻譯歧義性)Disambiguationfromamongmultiplealternativetermtranslations,多個(gè)翻譯如何選擇?e.g.,Apple,BankUseofpart-of-speech(POS)tags.Useofparallelcorpus.Useofco-occurrencestatisticsinthetargetcorpus.Useofthequeryexpansiontechnique.3/11/202436Useofpart-of-speechtagsThebasicideaofusingpart-of-speech(POS)tagsfortranslationdisambiguationistoselectonlytranslationshavingthesamePOSwiththatofthesourcequeryterm.ThismethodrequiresthatPOStaggingsoftwareisavailableforbothlanguages.3/11/202437Parallelcorpus-baseddisambiguationAparallelcorpuswasusedfordeterminingthe‘‘best’’translationorsetoftranslationsbyDavis(1997,1998),whereasingletranslationforeachsourcetermwasselectedfromasetoftranslationslistedinanMRDaccordingtotheresultofsearchingaparallelcorpus.3/11/202438Translationprobability探測survey試探樣品測量(p=0.4)(p=0.3)(p=0.25)(p=0.05)多個(gè)翻譯翻譯概率3/11/202439Disambiguationbasedonco-occurrencestatisticsthecorrecttranslationsofquerytermsshouldco-occurintargetlanguagedocumentsandincorrecttranslationsshouldtendnottoco-occur.First,thetwomostrelatedtermsinthequeryweredeterminedbasedoncooccurrencestatisticsinthesourcelanguagecorpus,andthenthe‘‘best’’translationswereselectedfromallpairsoftranslationsofthesetwotermsaccordingtoco-occurrencestatisticsinthetargetlanguagecorpus.Itshouldbenotedthatthesetwocorporadonothavetobeparallelorcomparable.3/11/202440QueryexpansionfordisambiguationPseudorelevancefeedback(PRF),alsoknownasblindfeedback,iswidelyrecognizedasaneffectivetechniqueforenhancingperformanceofinformationretrieval.PRFalsoworkseffectivelyforCLIRtasks.InthecaseofCLIR,twokindsofPRFarefeasible:Pre-translationfeedbackandPost-translationfeedback3/11/202441Pre-translationfeedbackDocumentsfromacorpusinthesourcelanguagecanberetrievedpriortotranslationinordertoaddasetofnewtermstothesourcequery(pre-translationfeedback)ifsuchacorpusisavailable.Pre-translationfeedbackmaycontributetoimprovementofprecision.ThisisduetothefactthatthePRFisbasicallydoneusingtheentirequery––noteachsourcetermrespectively.Thatis,synonymsorrelatedtermscorrespondingtothe‘‘correct’’meaningofeachsourcetermwithinacontextofthequeryareexpectedtobeautomaticallyaddedthroughthePRFprocess.3/11/202442Post-translationfeedbackAftertranslation,standardPRFcanbeappliedusingthetargetdocumentcollection(post-translationfeedback).post-translationfeedbackcanbeconsideredadeviceforimprovingrecallratio,asshowninstandardexperimentsofmonolingualretrieval.InCLIR,twowell-knownmethodsforweightingtermsinthetop-rankeddocumentsareoftenutilizedforselecting‘‘good’’terms,i.e.,theRocchiomethodandtheprobabilisticmethod.3/11/202443bi-directionaltranslationBoughanemetal.(2002),exploreda‘‘bi-directionaltranslation’’techniqueinwhichaformofbackwardtranslationisusedforrankingtranslationcandidates.SupposethatweneedtotranslateEnglishquerytermsintoFrenchones.In‘‘bi-directionaltranslation,’’firstasetofFrenchequivalentsforanEnglishtermisfoundinanEnglish–Frenchdictionary.Next,usingaFrench–Englishdictionary,eachFrenchequivalentisreverselytranslatedintoasetofEnglishterms.Basically,ifthesetincludestheoriginalsourceterm,theFrenchtranslationequivalentischosenasapreferredtranslation.3/11/202444跨語言檢索評價(jià)信息檢索評價(jià)給定一個(gè)檢索主題,一個(gè)文檔集合,一些人工判斷好的相關(guān)文獻(xiàn)對系統(tǒng)返回的檢索結(jié)果進(jìn)行判斷TRECCLIR(96-02):英語到其他語言CLEF(00-):歐洲語言之間NTCIR(99-):亞洲語言與英語3/11/202445跨語言檢索評價(jià)模型3/11/202446ApplicationsofCLIR472.1CrosslanguageSearchEngineApril25,2006:Europeansearchengine“Quaero”

FrenchPresidentannounced90million-eurosupport.May16,2007:GoogleTranslateProvideCLIRfor12languagesGoal:take"alltheWeb&translateintomultiplelangs."May5,2008:YahooBabelFishProvideCLIRbetween12languagesItwasAltaVista'sproject,laterboughtbyYahoo3/11/202448GoogleTranslate

3/11/2024493/11/202450YahooBabelFish

3/11/2024513/11/2024523/11/202453提問請比較Google和Yahoo!的跨語言搜索引擎的區(qū)別,分析各自的優(yōu)缺點(diǎn)Google:一步完成(translate&search),檢索結(jié)果翻譯回源語言。優(yōu)點(diǎn):快速,便于用戶理解檢索結(jié)果。缺點(diǎn):用戶無法修改翻譯。Yahoo?。簝刹酵瓿桑╰ranslate+search),檢索結(jié)果未翻譯。優(yōu)點(diǎn):有中間步驟,用戶可以修改翻譯。缺點(diǎn):復(fù)雜,檢索結(jié)果無法識別。3/11/2024542.2數(shù)字圖書館的跨語言檢索2010年6月11日在芬蘭首都赫爾辛基舉行的ICSTI(國際科技信息理事會)夏季會議上發(fā)布的世界科學(xué)跨語言檢索平臺WorldWideScience3/11/202455WorldWideScience

/multilingual聯(lián)盟的成員單位都是專業(yè)圖書情報(bào)機(jī)構(gòu)或科技信息事業(yè)的領(lǐng)導(dǎo)機(jī)構(gòu),如美國能源部科技信息局(OSTI)、美國國會圖書館、大英圖書館、加拿大科技信息研究所、韓國科技信息研究所、中國科技信息研究所等。該平臺還可以自動(dòng)進(jìn)行跨語言跨庫檢索3/11/202456WorldWideScience

/multilingual3/11/2024572.3跨語言專利檢索根據(jù)世界知識產(chǎn)權(quán)組織(WorldIntellectualPropertyOrganization,WIPO)報(bào)導(dǎo),專利文件包含全世界90%~95%的科研成果,而其他技術(shù)文件(論文或期刊等)中只含5%~10%的研發(fā)成果。在研究工作中若能善于利用專利檢索可以縮短60%的研發(fā)時(shí)間,同時(shí)減少40%的研發(fā)經(jīng)費(fèi)。3/11/202458PATENTSCOPE

/patentscope/search/en/clir/clir.jsp2010年5月,世界知識產(chǎn)權(quán)組織WIPO發(fā)布了跨語言專利檢索系統(tǒng)PATENTSCOPE的測試版,標(biāo)志著跨語言信息檢索在專利檢索中的應(yīng)用從實(shí)驗(yàn)室走向?qū)嵱没T撓到y(tǒng)只能提供英語、法語、德語、日語、西班牙語5種語言之間的跨語言專利檢索。3/11/202459PATENTSCOPE

/patentscope/search/en/clir/clir.jsp3/11/202460PATENTSCOPE

/patentscope/search/en/clir/clir.jsp3/11/2024612.4跨語言圖像檢索目前,已走向?qū)嵱没目缯Z言圖像檢索的代表是由華盛頓大學(xué)開發(fā)的一個(gè)跨語言圖像搜索引擎PanImages(/)PanImages提供100多種語言的翻譯用戶輸入關(guān)鍵字并選擇其隸屬于哪種語言,通過機(jī)器翻譯將關(guān)鍵詞轉(zhuǎn)換成各個(gè)國家的語言,將翻譯的關(guān)鍵詞在Google圖片搜索和Flickr圖片搜索中進(jìn)行搜索3/11/202462PanImages

/3/11/202463PanImages

/3/11/2024642.5電子商務(wù)中的應(yīng)用CINDOR是目前比較成功的一個(gè)商業(yè)跨語言信息檢索系統(tǒng)CINDOR系統(tǒng)擁有概念中間語言(ConceptualInterlingua)、語言分析(LanguageAnalysis)、搜索管理(SearchManagement)三大核心技術(shù)。CINDOR目前支持英語、法語、西班牙語,正在研制簡體中文、俄語、阿拉伯語。3/11/202465CINDOR

/home.html3/11/202466CINDOR

/home.html3/11/202467ReferenceKazuakiKishida.Technicalissuesofcross-languageinformationretrieval:areview.InformationProcessingandManagement.2005(41),pp433-455.葛運(yùn)東;跨語言信息檢索查詢翻譯技術(shù)研究[D];蘇州大學(xué);2010王序文.基于主題偽相關(guān)反饋的跨語言信息檢索技術(shù)研究[D];北京郵電大學(xué),2014彭琳.漢語詞語語義相似度度量及其在跨語言信息檢索中的應(yīng)用研究[D];復(fù)旦大學(xué),20103/11/202468對“交互”的挑戰(zhàn)CLIRposessomeuniquechallengesforinteractionHowdoyouhelpusersselecttranslatedqueryterms?Howdoyouhelpusersse

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論