天網搜索平臺PARADISE_第1頁
天網搜索平臺PARADISE_第2頁
天網搜索平臺PARADISE_第3頁
天網搜索平臺PARADISE_第4頁
已閱讀5頁,還剩60頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

天網搜索平臺PARADISE閆宏飛北京大學計算機系網絡實驗室2009/4/24信息檢索前沿概述閆宏飛北京大學計算機系網絡實驗室2009/4/24345OutlineIssues:searchengineandwebminingGoal:FrameworkPrinciples:Relatedcomponents(照應任務各部分關聯)Implementation:Achievements(應用成果)Casestudy(具體到一事)6SearchEngineandWebMiningCrawlingFull-textindexingRetrievingWebarchivingandMining7WebGraph:bowtieteapotModel:bagofwordsInfomall,CDAL:archivehistraceEvaluation:manualautomaticPlatform:proprietaryopen89OutlineMotivationtoBuildPARADISEDesignandImplementPARADISEResearchIssuesofConcern10Corpus(1/3)Shakespeare’scollectedworksAgzippedtar-file[2039k]http://www.cs.su.oz.au/~matty/Shakespeare/shakespeare.tar.gzRCV1,ReutersCorpusVolume1.oneyearofReutersnewswire1996-08-20to1997-08-19containsabout810,000ReutersEnglishLanguageNewsstories.requiresabout2.5GBforstorageoftheuncompressedfiles./data/reuters/reuters.html

11Corpus(2/3)CWT100GB,ChineseWebTestcollection2004年6月搜集,有5,712,710個網頁,解壓容量為90GB。CWT200GB2005年11月搜集,有37,482,913個網頁,壓縮容量為197GB/12Corpus(3/3)2008/10/1513Hardwareassumptionssymbol statistic values averageseektime 5ms=5x10?3sb transfertimeperbyte 0.02μs=2x10?8s processor’sclockrate 10?9sp lowleveloperation 0.01μs=10?8s(e.g.,compare&swapaword) sizeofmainmemory severalGB sizeofdiskspace 1TBormore14ReutersRCV1statisticssymbol statistic valueN documents 800,000L avg.#tokensperdoc 200M terms(=wordtypes) 400,000avg.#bytespertoken 6(incl.spaces/punct.)avg.#bytespertoken 4.5(withoutspaces/punct.)avg.#bytesperterm 7.5tokens 100,000,0004.5bytesperwordtokenvs.7.5bytesperwordtype:why?15DocumentsareparsedtoextractwordsandthesearesavedwiththeDocumentID.IdidenactJuliusCaesarIwaskilledi'theCapitol;Brutuskilledme.Doc1SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc2Indexconstruction16KeystepAfteralldocumentshavebeenparsed,theinvertedfileissortedbyterms.Wefocusonthissortstep.Wehave100Mitemstosort.17ScalingindexconstructionIn-memoryindexconstructiondoesnotscale.Howcanweconstructanindexforverylargecollections?TakingintoaccountthehardwareconstraintsMemory,disk,speedetc.18Sort-basedIndexconstructionAswebuildtheindex,weparsedocsoneatatime.Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)Thefinalpostingsforanytermareincompleteuntiltheend.At12bytesperpostingsentry,demandsalotofspaceforlargecollections.T=100,000,000inthecaseofRCV1So…wecandothisinmemoryin2008,buttypicalcollectionsaremuchlarger.E.g.NewYorkTimesprovidesindexof>150yearsofnewswireThus:Weneedtostoreintermediateresultsondisk.19Usethesamealgorithmfordisk?Canweusethesameindexconstructionalgorithmforlargercollections,butbyusingdiskinsteadofmemory?No:SortingT=100,000,000recordsondiskistooslow–toomanydiskseeks.Weneedanexternalsortingalgorithm.20BottleneckParseandbuildpostingsentriesonedocatatimeNowsortpostingsentriesbyterm(thenbydocwithineachterm)Doingthiswithrandomdiskseekswouldbetooslow–mustsortT=100MrecordsIfeverycomparisontook2diskseeks,andNitemscouldbesortedwithNlog2Ncomparisons,howlongwouldthistake?2112-byte(4+4+4)records(term,doc,freq).Thesearegeneratedasweparsedocs.Mustnowsort100Msuch12-byterecordsbyterm.DefineaBlock~10MsuchrecordsCaneasilyfitacoupleintomemory.Willhave10suchblockstostartwith.Basicideaofalgorithm:Accumulatepostingsforeachblock,sort,writetodisk.Thenmergetheblocksintoonelongsortedorder.BSBI:Blockedsort-basedIndexing(Sortingwithfewerdiskseeks)22BSBI:Blockedsort-basedIndexing23Remainingproblemwithsort-basedalgorithmOurassumptionwas:wecankeepthedictionaryinmemory.Weneedthedictionary(whichgrowsdynamically)inordertoimplementatermtotermIDmapping.Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings......butthenintermediatefilesbecomeverylarge.(Wewouldendupwithascalable,butveryslowindexconstructionmethod.)24SPIMI:

Single-passin-memoryindexingKeyidea1:Generateseparatedictionariesforeachblock–noneedtomaintainterm-termIDmappingacrossblocks.Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.Theseseparateindexescanthenbemergedintoonebigindex.25

SPIMI-InvertMergingofblocksisanalogoustoBSBI.26DistributedindexingForweb-scaleindexing(don’ttrythisathome!):mustuseadistributedcomputingclusterIndividualmachinesarefault-proneCanunpredictablyslowdownorfailHowdoweexploitsuchapoolofmachines?27GoogledatacentersGoogledatacentersmainlycontaincommoditymachines.Datacentersaredistributedaroundtheworld.Estimate:atotalof1millionservers,3millionprocessors/cores(Gartner2007)Estimate:Googleinstalls100,000serverseachquarter.Basedonexpendituresof200–250milliondollarsperyearThiswouldbe10%ofthecomputingcapacityoftheworld!?!28GoogledatacentersIfinanon-fault-tolerantsystemwith1000nodes,eachnodehas99.9%uptime,whatistheuptimeofthesystem?Answer:37%Calculatethenumberofserversfailingperminuteforaninstallationof1millionservers.29DistributedindexingMaintainamastermachinedirectingtheindexingjob–considered“safe”.Breakupindexingintosetsof(parallel)tasks.Mastermachineassignseachtasktoanidlemachinefromapool.30ParalleltasksWewillusetwosetsofparalleltasksParsersInvertersBreaktheinputdocumentcorpusintosplitsEachsplitisasubsetofdocuments(correspondingtoblocksinBSBI/SPIMI)31ParsersMasterassignsasplittoanidleparsermachineParserreadsadocumentatatimeandemits(term,doc)pairsParserwritespairsintojpartitionsEachpartitionisforarangeofterms’firstletters(e.g.,a-f,g-p,q-z)–herej=3.Nowtocompletetheindexinversion32InvertersAninvertercollectsall(term,doc)pairs(=postings)foroneterm-partition.Sortsandwritestopostingslists33DataflowsplitsParserParserParserMastera-fg-pq-za-fg-pq-za-fg-pq-zInverterInverterInverterPostingsa-fg-pq-zassignassignMapphaseSegmentfilesReducephase34MapReduceTheindexconstructionalgorithmwejustdescribedisaninstanceofMapReduce.MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputingwithouthavingtowritecodeforthedistributionpart.TheydescribetheGoogleindexingsystem(ca.2002)asconsistingofanumberofphases,eachimplementedinMapReduce.35MapReduceIndexconstructionwasjustonephase.Anotherphase:transformingaterm-partitionedindexintodocument-partitionedindex.Term-partitioned:onemachinehandlesasubrangeoftermsDocument-partitioned:onemachinehandlesasubrangeofdocumentsmostsearchenginesuseadocument-partitionedindex…betterloadbalancing,etc.36SchemaforindexconstructioninMapReduceSchemaofmapandreducefunctionsmap:input→list(k,v)reduce:(k,list(v))→outputInstantiationoftheschemaforindexconstructionmap:webcollection→list(termID,docID)reduce:(<termID1,list(docID)>,<termID2,list(docID)>,…)→(postingslist1,postingslist2,…)Exampleforindexconstructionmap:d2:Cdied.d1:Ccame,Cc’ed.→(<C,d2>,<died,d2>,<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>reduce:(<C,(d2,d1,d1)>,<died,(d2)>,<came,(d1)>,<c’ed,(d1)>)→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:1)>,<c’ed,(d1:1)>)37有幾個問題需要解決很難找到合適的硬件設施,基于這些數據開展工作存儲,運行,維護,….缺少處理這些數據的有效的軟件工具分析,檢索,評測,….如何維護有stateofarts算法的軟件工具模塊替換,功能測試,性能測試,…38智能中文搜索引擎技術研究及平臺構建由一個平臺,四個任務組成的.達到平臺能夠展示四個任務.平臺是:一個開放式智能化中文搜索引擎平臺任務是:構建適用于中文互聯網的文本內容抽取系統(tǒng)及網頁文本內容結構化分析基于內容的多媒體信息檢索基于自然語言理解的查詢分析基于用戶屬性挖掘的個性化檢索/src/paradise/39PARADISEDesign

PlatformforApplying,ResearchingAndDevelopingIntelligentSearchEngineServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersBareMetalMachinesAutomationClusterManagementToolsDatanodesTianwangjobsandevaluationjobs40PARADISE索引設計介紹41當前設計中的基本功能Usingfastersingle-passindexingHighlycompressedindexdiskdatastructuresVariouscompressalgorithm(*)Variousinvertindexfileformat(*)indexposition,freqvalueorimpactvalue(*)Indexfieldinformation,suchastitle,dateCachesupport(*)42下一步設計的功能Indexvariousdocumentformat,suchasword,pdf使用MapReduce等技術的索引多種切詞的實現與處理增量索引與文檔刪除Termvector存儲多segment的合并策略43InvertedindexDocument-sortedindexThepointersineachinvertedlistarestoredinincreasingdocumentorderUsingd-gapFrequency-sortedindexInvertedlistarestoredindecreasingtermfrequencyscoreImpact-sortedindexThepointersareorderedaccordingtotheimpactvalueImpactvalueisasmallintegerrepresentingtheoverallcontributionoftermttothescoreofdocumentd,includethefactorusedtonormalizefordocumentlength44ProcessingquerymodeDocument-at-a-timeAllinvertedlistsaresimultaneouslyaccessed|q|-wayprocessingiscarriedouttohandleaqueryqof|q|termsTerm-at-a-timeOnlyoninvertedlistisaccessedatanygiventimeAsequenceof|q|-1binaryoperationsareperformedNotclearbetterthandocument-at-a-timeprocessingScore-at-a-timeAllofthetermlistsareopenforreadingProcessedindecreasingscorecontributionorderratherthaninincreasingdocumentnumberorderDynamicquerypruning45IndexlevelsDocumentnumbersonlyOnlysupportBooleanqueriesDocumentnumbersandScoreScorecanbeimpactscoresorfrequentorbothSupportBooleanandrankedqueriesDocumentnumbers,Score,PositionsSupportBoolean,ranked,phrase(proximityqueries)46TypesofindexandsupportquerymodestypedsposOrganizationProcessingmodesupportedDY--Document-sortTermordoc–at-a-timeDSYY-Document-sortImpactorfreq-sortTermordoc–at-a-timeDSPYYYDocument-sortedTermordoc-at-a-timescore-at-a-time47InterleavingPointerinterleavingEachdocumentnumberisimmediatelyfollowedbythecorrespondingworfvalueorposvalueTerminterleavingKeepvariousrelatedpartsofeachinvertedlistincontiguousblocsNon-interleavingUsingseparateinvertedfileEachstoringaparticulartypeofinformationFordistinctdiskpointersmaintainedineachentryinthedictionary48Block-interleavedindexesEachinvertedlistisbuiltupasasequenceofoneormoregroupsWithineachgrouptherearefourormorek-blocks49Cache是一個工具包,為一些模塊提供cache功能Cahce策略與存儲分開有多種cahce策略 當前實現MRU策略多種存儲策略基于STLmap采用模板實現,保證其通過性50DocumentThelogicalrepresentationofaDocumentforindexingandsearchingADocumentisacollectionofFieldablesAFieldableisalogicalrepresentationofauser'scontentthatneedstobeindexedorstoredFieldableshaveanumberofpropertiesthattelltheindexhowtotreatthecontent(likeindexed,tokenized,stored,etc.)51Compress工具包,提供整形數字的解壓縮實現多種解壓縮算法采用工廠方法模式52AnalysisAnanalyzerbuildsTokenStreams,whichanalyzetextItthusrepresentsapolicyforextractingindextermsfromtextTypicalimplementationsfirstbuildaTokenizer,whichbreaksthestreamofcharactersintorawTokensOneormoreTokenFiltersmaythenbeappliedtotheoutputoftheTokenizer53PARADISEImplementation

/src/paradise/

applicationPKUsearchpdfliteraturesearch2008Olympicssearchmodulecrawlanalysis&indexsearchutilitycompressBuffer...thirdpartycmake,boost,zlib,tidy,openssl,berkeleydbdataTianwangFormatquerylogIPLocator

tutorialCMakeandBoostreferenceBooksandpaperscourseInformationRetrievalCloudComputingDistributedSy

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論