信息檢索第一講_第1頁
信息檢索第一講_第2頁
信息檢索第一講_第3頁
信息檢索第一講_第4頁
信息檢索第一講_第5頁
已閱讀5頁,還剩67頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

InformationRetrieval&WebDongShoubin董守斌)IRRetrieval

TypicalIRAcorpusoftextualnatural-languagedocuments.Auserqueryintheformof

..ArankedsetofdocumentsthatareDefinitionsofInformationRetrieval(IR)Salton(1989):“Information-retrievalsystemsprocessfilesofrecordsandrequestsforinformation,andidentifyandretrievefromthefilescertainrecordsinresponsetotheinformationrequests.Theretrievalofparticularrecordsdependsonthesimilaritybetweentherecordsandthequeries,whichinturnismeasuredbycomparingthevaluesofcertainattributestorecordsandinformationrequests.”Kowalski(1997):“AnInformationRetrievalSystemisasystemthatiscapableofstorage,retrieval,andmaintenanceofinformation.Informationinthiscontextcanbecomposedoftext(includingnumericanddatedata),images,audio,video,andothermulti-mediaobjects).”WhatisInformationInformationRetrieval(IR)isthestudyofusuallytext,butalsoaudioandDataisconsideredunstructuredthestructureisunknown,thesemanticsofeachcomponentareunknown.e.g.,“bank”,“nut”,“sun”,“whitehouse”IRsystemsexploitstatisticalregularitiesinthewithouttryingto“understand”whattheyContrastingNLP(NaturalLanguageprocessing:自然語言處理)systemstrytofindthemeaning(semantics)inunstructuredtextInformationvs.DataDataRetrieveallobjectswhichsatisfyclearlydefinedconditionsinregularexpressionorrelationalalgebraexpressionDatahasawelldefinedstructureandSolutiontotheuserofadatabasesystemInformationneed=>Retrieveallthedocumentswhicharerelevanttoauserquery,whileretrievingasfewnon-relevantdocumentsaspossibleSimpleflowofretrievalKeystoInformationRepresentationofinformationRepresentationofObjectretrievalaccordingtorelevancebasedontherepresentationsEvaluationRelevanceisasubjectivejudgmentandmayBeingontheproperBeingtimely(recentSatisfyinginformationneed:thegoalsoftheuserandhis/herintendeduseoftheinformation.ThenotionofSimplestnotion:thequerystring(keyword)appearsverbatiminthedocumentSlightlylessstrictnotion:thekeywordappearfrequentlyinthedocument,inanyorder(bagofwords).IRRetrievalKeyidea:Everything(documents,queries,terms)isavectoroftermsinahigh-dimensionalspacedj=(w1j,w2j,…,Eachtermki,inadocumentorqueryj,isgivenanon-binaryweight,wij.ThesetermweightsareusedtocomputeadegreeofbetweenaqueryandeachRankedsetofdocumentsprovidesforbettermatchingandpartial Dimension=t=|vocabulary|VectorSpaceModel(VSM)wasdevelopedintheSMARTsystem(Salton,c.1970)andstandardlyusedbyTRECparticipantsandwebsearchengine.Graphic5D1=2T1+3T2+Q=0T1+0T2+ D2=3T1+7T2+ ?IsD1orD2moresimilartoHowtomeasurethedegree similarity?Distance? DocumentCollectionAcollectionofndocumentscanberepresentedinthevectorspacemodelbyaterm-documentmatrix.Anentryinthematrixcorrespondstothe“weight”ofaterminthedocument;zeromeansthetermhasnosignificanceinthedocumentoritsimplydoesn’texistinthedocument. w21 … HowtocalculateTermfrequencyTheNumberofTermOccurs,…0740220060052019012600300200…0008AnyCountingTermstellusaboutIf“rabbit”appearsalot,itmaybeaboutDocumentstellusabout“the”isineverydocument--notDocumentsaremostlikelydescribedwellraretermsthatoccurinthemHigher“termfrequency”isstrongerLow“collectionfrequency”makesitstrongerTermWeightingtffactortermfrequency,詞頻 therawfrequencyofatermkiinsideadocumentdjidffactorinversedocumentfrequency逆theinverseofthefrequencyofatermkiamongthedocumentsinthecollectionTF-IDFTermAtypicalweightingistf-idf

idfi=tf

dfjAtermoccurringfrequentlyinthedocumentbutrarelyintherestofthecollectionisgivenhighweight.Hightermfrequencyisevidenceofmeaning.AndhighIDFisevidenceoftermimportance.VerygoodweightingItwasalsotheoreticallyprovedtoworkwell(Papineni,“Whyinversedocumentfrequency?”NAACL2001)Documentsas…000000000000…000SowehaveavectortermsaredocsliveinthisVeryhighdimensions:thoughstemming,mayhave20,000+QueryQueryvectoristypicallytreatedasa(veryshort)documentandalsotf-idfweighted.Alternativeisfortheusertosupplyweightsforthegivenqueryterms.Querytermweight(SaltonandBuckley) =(0.5+0.5freqi,q )·logNi,q maxl freqi,q freqi,q:therawfrequencyofthetermkiinSimilarityWenowhavevectorsforalldocumentsinthecollection,avectorforthequery,howtocomputeAsimilaritymeasureisafunctionthatcomputesthedegreeofsimilarity(相似度)betweentwoUsingasimilaritymeasurebetweenthequeryandeachdocument:Itispossibletoranktheretrieveddocumentsintheorderofpresumedrelevance.Itispossibletoenforceacertainthresholdsothatthesizeoftheretrievedsetcanbecontrolled.CosineDistancebetweenvectorsd1andd2capturedbythecosineoftheanglexbetweenthem.sim(dj,dk)= ttttThecosinemeasureisknownasthenormalizedinnerproduct.i,Lengthd tCosineNormalizationtfi,

123 123412345241354363321372614RetrievalQuery:contaminated1111231111234SimilarityDocDocDocDocRankedList:2,3,1,4(innerproduct)VectorSpaceModel:Simple,mathematicallybasedMapeverything(documents,queries,sentences,terms)toaAbilitytoincorporatetermAnytypeoftermweightscanbeTFIDFweighting:considersbothlocal(tf)andglobal(idf)wordoccurrencefrequencies.ProvidespartialmatchingandrankedTendstoworkquitewellinpracticedespiteobviousThevectorspacemodelisthemostThevectorspacemodelisthemostpopularretrievalmodelVectorSpaceModel:Missingsemanticinformation(e.g.wordMissingsyntacticinformation(e.g.phrasestructure,wordorder,proximityinformation).Assumptionoftermindependence(e.g.ignoresLacksthecontrolofaBooleanmodelrequiringatermtoappearinaGivenatwo-termquery“AB”,maypreferadocumentcontainingAfrequentlybutnotB,overadocumentthatcontainsbothAandB,butbothlessfrequently.IRRetrievalHowtoOnlinetextNa?ve:ScanthetextHowlongwillittaketofindaIsthereanyworkwecandoinIfso,howlongwillthatIndexedBuilddatastructuresoverthetexttospeeduptheSemi-staticcollections:updatedatreasonablyregularOriginalDocument101110110000011RetrievalwithanInvertedTokensthatarenotinboththequeryandthedocumentdonoteffectcosinesimilarity.Productoftokenweightsiszeroanddoesnotcontributetothedotproduct.Usuallythequeryisfairlyshort,andthereforeitsvectorisextremelysparse.UseinvertedindextofindthelimitedsetofdocumentsthatcontainatleastoneofthequeryInverted100010110101101InvertedWord-orientedmechanismforindexingtestcollectionstospeedupsearchingInvertedEachdocumentisassignedalistofkeywordsorEachkeyword(attribute)isassociatedwithoperationalrelevanceweights.Aninvertedfileisthesortedlistofkeywords(attributes),witheachkeywordhavinglinkstothedocumentscontainingthatkeyword.IndexDIndexDj,D1,D5,D7,D2,Indexfile PostingsFriends,Romans,beFriends,Romans,beLinguistic化12國24國2412345678Producing123456780001000101010100101000101010101001010101001010000010101001010101001000001010101001010001010001011010101100000101101000001000101001010100

4,2,4,2,4,1,3,1,3,2,4,6,3,3,5,2,4,6,31,3,5,2,4,2,6,1,3,5,7,6,1,1,5,2,4,WhatGoesinaPostingsD4,D2,D4,D4,D2,D4,JustthedocumentD4:2,D4:2,D2:1,D4:3,D1:1,D3:1,Documentnumberandtermweight(TF*IDF,...)D4(17,36),D4(17,36),D2(44),D4(10,20),D1(18),D3(6),WordoffsetsforeachoccurrenceofthetermBooleanKeywordscombinedwithBooleanOR:(e1ORAND:(e1ANDBUT:(e1BUTe2)Satisfye1butnotNegationonlyallowedusingBUTtoallowefficientuseofinvertedindexbyfilteringanotherefficientlyretrievableset.Na?veusershavetroublewithBooleanBooleanRetrievalwithInvertedIndexPrimitivekeyword:Retrievecontainingdocumentsusingtheinvertedindex.OR:Recursivelyretrievee1ande2andtakeunionofresults.AND:Recursivelyretrievee1ande2andtakeintersectionofresults.BUT:Recursivelyretrievee1ande2andtakesetdifferenceofresults.BooleanQueryprocessing248248281232812385Ifthelistlengthsarexandy,themergetakesO(x+y)operations.(Crucial:postingssortedbydocID)PhrasalRetrievedocumentswithaspecificphrase(orderedlistofcontiguouswords)“informationMayallowinterveningstopwordsand/or“buycamera”matches:“buyacamera”“buyingthecameras” PhrasalRetrievalwithInvertedIndicesMusthaveaninvertedindexthatalsopositionsofeachkeywordinaRetrievedocumentsandpositionsforeachindividualword,intersectdocuments,andthenfinallycheckfororderedcontiguityofkeywordBesttostartcontiguitycheckwiththeleastcommonwordinthephrase.PhrasalFindsetofdocumentsDinwhichallkeywords(k1…km)inphraseoccur(usingANDqueryprocessing).Intitializeemptyset,R,ofretrieveddocuments.Foreachdocument,d,inD:Getarray,Pi,ofpositionsofoccurrencesforeachkiinFindshortestarrayPsoftheForeachpositionpofkeywordksinForeachkeywordkiexceptUsebinarysearchtofindaposition(p–s+i)inthearrayIfcorrectpositionforeverykeywordfound,adddtoIndexingandWalktheinvertedfile,splittingifInsertintothepostingsfileinsortedHoursordaysforlargeQueryWalktheinvertedReadthepostingsManipulatepostingsbasedonSeconds,evenforenormousIRRetrievalWhySystemTherearemanyretrievalmodels/algorithms/systems,whichoneisthebest?WhatisthebestcomponentRankingfunction(dot-product,cosine,Termselection(stopwordremoval,Termweighting(TF,TF-Howfardowntherankedlistwillauserneedtolooktofindsome/allrelevantdocuments?DifficultiesinEvaluatingIREffectivenessisrelatedtotherelevancyofretrieveditems.RelevancyisnottypicallybinarybutEvenifrelevancyisbinary,itcanbeadifficultjudgmenttomake.Relevancy,fromahumanstandpoint,Subjective:Dependsuponaspecificuser’sSituational:Relatestouser’scurrentCognitive:DependsonhumanperceptionandDynamic:ChangesoverAutomaticEvaluation RankedRelevanceMeasureofSet-BasedEffectiveness

Relevant EntiredocumentPrecision&NotRelevant RelevantIrrelevant IrrelevantNotPrecisionRecall

PrecisionandHowmuchofwhatwasfoundisTheabilitytoretrievetop-rankeddocumentsthataremostlyrelevant.HowmuchofwhatisrelevantwasTheabilityofthesearchtofindalloftherelevantitemsinthecorpus.TotalnumberofrelevantitemsissometimesnotTrade-offbetweenRecalland1Returnsrelevantdocumentsbutmissesmanyusefulonestoo110

TheReturnsmostrelevantdocumentsbutincludeslotsofjunkComputingRecall/PrecisionForagivenquery,producetherankedlistofAdjustingathresholdonthisrankedlistproducesdifferentsetsofretrieveddocuments,andthereforedifferentrecall/precisionmeasures.Markeachdocumentintherankedlistthatisrelevantaccordingtothegoldstandard.Computearecall/precisionpairforeachpositionintherankedlistthatcontainsarelevantndoc1ndoc1x2x34x56x789x

LetLettotal#ofrelevantdocsCheckeachnewrecallR=1/6=0.167;R=2/6=0.333; MissingoneR=5/6=0.833;R=5/6=0.833;InterpolatingaRecall/PrecisionInterpolate插值at11standardrecalllevels(11-rj?{0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,r0=0.0,r1=0.1,…,Theinterpolatedprecisionatthej-thstandardrecalllevelisthemaximumknownprecisionatanyrecalllevelgreaterthatthej-thlevel:P(rj)=rj

P(r)Example:Interpolatingat11-RecallLevelsndocndoc1x2x34x56x789x11112131415678900Example:InterpolatingaRecall/PrecisionCurve 1.0CompareTwoorMore1PrecisionPrecision0 Thecurve(RPCurve)closesttotheupperright-handcornerofthegraphindicatesthebestperformanceMeanAveragePrecisionHowtocomparetwosystembyP/RcurvesNeedtoaverageovermanyTwomaintypesofMicro-average(微平均averageoverallqueries(eachrelevantdocumentisapointintheaverage)Macro-average(宏平均)-averageofwithin-queryprecision/recall(eachqueryisapointintheaverage)MeanAveragePrecision(MAP):meanofaverageprecisionformanyqueriesInterpolated:Avgprecisionatrecall=0.0,0.1,…,Uninterpolated:AvgprecisionateachrelevantOnemeasureofperformancethattakesintoaccountbothrecallandprecision.Harmonicmean調(diào)和平均/倒數(shù)平均ofrecallandprecision:=2PRF=P+=

Comparedtoarithmeticmean(算術(shù)平均),bothneedtobehighforharmonicmeantobeEMeasure(ParameterizedFMeasure)AvariantofFmeasurethatallowsweightingemphasisonprecisionoverrecall: 2E=b2P+ 2b+ Valueofbcontrolstrade-b=1:Equallyweightprecisionandrecallb>1:Weightrecallb<1:WeightprecisionFocusingonTopUserstendtolookatonlythetoppartoftherankedresultlisttofindrelevantdocumentsSomesearchtaskshaveonlyonerelevante.g.,navigationalsearch,questionRecallnotinsteadneedtomeasurehowwellthesearchenginedoesatretrievingrelevantdocumentsatveryhighPrecision@N,R-Precision,MeanReciprocalRank(MRR)R-R-PrecisionR-查準率PrecisionattheR-thpositionintherankingofresultsforaquerythathasRrelevantdocuments.ndoc1x2x34xndoc1x2x34x56x789789xSingle-ValuedMeanprecisionatafixednumberofPrecision@10,Precision@20areoftenusedforWebMRR(MeanReciprocalMeanofthereciprocalranks排序倒數(shù):正確答案序號的倒數(shù))overallthetopicsverysensitivetorankDiscountedCumulativeGainPopularmeasureforevaluatingwebsearchandrelatedtasksTwoHighlyrelevantdocumentsaremoreusefulthanmarginallyrelevantdocumentthelowertherankedpositionofarelevantdocument,thelessusefulitisfortheuser,sinceitislesslikelytobeexaminedUsesgradedrelevanceasameasureof

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論