基于Lucene的圖書搜索引擎的設(shè)計與實現(xiàn)外文文獻(xiàn)_第1頁
基于Lucene的圖書搜索引擎的設(shè)計與實現(xiàn)外文文獻(xiàn)_第2頁
基于Lucene的圖書搜索引擎的設(shè)計與實現(xiàn)外文文獻(xiàn)_第3頁
基于Lucene的圖書搜索引擎的設(shè)計與實現(xiàn)外文文獻(xiàn)_第4頁
基于Lucene的圖書搜索引擎的設(shè)計與實現(xiàn)外文文獻(xiàn)_第5頁
已閱讀5頁,還剩3頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、濱江學(xué)院 畢業(yè)論文(設(shè)計)外文翻譯題目 基于lucene的圖書搜索引擎學(xué)生姓名 學(xué) 號 院 系 濱江學(xué)院計算機(jī)系 專 業(yè) 軟件工程 指導(dǎo)教師 二一一年 四 月二十八日理解lucene不同的人用不同的途徑討論同一個問題信息超載。其中一些致力于研究新型的用戶界面,另一些研究智能代理,其他的則研究如lucene一樣的精確的搜索工具。在我們跳入下一章的示例代碼前,我們將向你描述lucene是什么,不是什么,怎樣起作用。lucene是什么lucene是一種高性能,可擴(kuò)展的信息檢索的資料庫。它讓你可以向應(yīng)用程序中添加索引和搜索能力。lucene是一款基于java語言的,成熟的,免費(fèi)開源的項目;它屬于流行的

2、apache jakarta項目下的一個分支并獲得apache軟件許可證。因此, lucene目前已成為幾年最熱門的免費(fèi)java信息檢索類庫。你很快就會發(fā)現(xiàn),lucene提供了一個既簡單又強(qiáng)大的核心的api核心,它僅需要對全文索引和搜索能力的少量認(rèn)識。你只需要了解它的少數(shù)類用以開始整合成為一個lucene的應(yīng)用。由于lucene的是一個java的資料庫,它沒有對索引和搜尋作出有關(guān)的假設(shè)。賦予它的優(yōu)勢,超過了一些其他的搜索應(yīng)用程序。人們剛開始使用全文檢索系統(tǒng)常常在準(zhǔn)備應(yīng)用程序時出錯例如一個文件檢索方案,一個網(wǎng)絡(luò)爬蟲,或一個網(wǎng)站搜索引擎。這不是真正的lucene: lucene是一個軟件庫,一個工

3、具包,但它并不是一個功能齊全的搜索應(yīng)用軟件。它與全文索引和搜索相關(guān),并且很好的完成它們。lucene讓您的應(yīng)用程序處理業(yè)務(wù)規(guī)則,與相關(guān)領(lǐng)域有關(guān),而隱藏在復(fù)雜的索引和搜索執(zhí)行的背后是簡單易用的api 。你可以把lucene作為一個描述成圖1.5中,在頂層的軟件層。一些功能齊全的搜索應(yīng)用程序被建立在lucene頂層。如果您在尋找一些編譯好的核心套件或一個框架,用于抓尋找文件處理,搜索,請查閱lucene的wiki上的“動力”網(wǎng)頁( /jakarta-lucene/poweredby )對于許多選項: zilverline , searchblox , nu

4、tch , larm , jsearch ,僅舉幾例。案例研究雙方的nutch和searchblox是包括在第10章。lucene能為您做什么lucene可以讓你把索引和搜索功能添加到您的應(yīng)用程序(這些功能將在1.3節(jié)描述 )。lucene可以索引和搜索任何可以轉(zhuǎn)換為文本格式的數(shù)據(jù)。你可以在圖1.5中看到 。lucene的不關(guān)心數(shù)據(jù)的來源,它的格式,甚至是它的語言,只要您可以轉(zhuǎn)換為文本。這意味著您可以使用lucene的索引和搜索的數(shù)據(jù)儲存在檔案:網(wǎng)頁上的遠(yuǎn)程web服務(wù)器,文件存儲在本地文件系統(tǒng),簡單的文本文件, microsoft word文件, html或pdf檔案,或任何其他從中可以提取的

5、文字信息的格式。同樣地,在lucene的幫助下,您可以把數(shù)據(jù)儲存在您的數(shù)據(jù)庫,讓您的用戶擁有許多數(shù)據(jù)庫不提供的全文搜索的能力。一旦你集成lucene,您的程序用戶申請可以搜索,如+喬治+水稻即食-布丁,蘋果餡餅+老虎,動物:猴子及食物:香蕉,等等。對于lucene,你可以索引和搜索電子郵件,郵件列表歸檔,即時信使聊天室,您的wiki網(wǎng)頁 ,不勝枚舉。lucene的歷史lucene的原本的書是由doug cutting著的,它最初是可供下載其主頁在sourceforge的網(wǎng)站。2001年9月它加入了apache軟件基金會的jakarta系列java的產(chǎn)品的高品質(zhì)的開放源代碼。自那時以來的每個版本

6、,該項目已得到了提高知名度,吸引更多的用戶和開發(fā)商。截至2004年7月, lucene的1.4與一個在10月初bug修復(fù)1.4.2版已經(jīng)發(fā)布。表1.1表明, lucene的的發(fā)布?xì)v史。版本發(fā)布日期里程碑0.012000年3月首次發(fā)布開放資源(代碼運(yùn)行穩(wěn)定)1.0200年10月 1.01b2001年6月最終源代碼穩(wěn)定發(fā)布1.22002年6月apache jakarta第一版發(fā)布1.32003年12月復(fù)合索引格式,查詢分析器增強(qiáng),遠(yuǎn)程搜索,象征性的定位api1.42004年7月整理,跨度疑問,任期載體1.4.12004年8月錯誤修復(fù)清理1.4.22004年10月指數(shù)搜索優(yōu)化和misc.f

7、ixes1.4.32004年冬季misc.fixes 表1.1備注: lucene的創(chuàng)始人,doug cutting,在ir領(lǐng)域具有重大的理論和實踐經(jīng)驗。他出版了一系列研究論文和各種關(guān)于ir主題的論文并為各公司工作,例如excite, apple, 和grand central 。最近,隨著網(wǎng)絡(luò)搜索引擎和在這一領(lǐng)域潛在的壟斷者的數(shù)目逐漸減少,他創(chuàng)造了nutch ,第一個開放源代碼搜索世界各地的網(wǎng)絡(luò)搜索引擎( ) ; 它為處理檢索而設(shè)計,索引和搜索數(shù)十億經(jīng)常更新的網(wǎng)頁。毫不奇怪, lucene的核心是nutch ; 10.1節(jié),包括研究nutch

8、如何利用lucene的例子。doug cutting仍然是lucene背后的主力軍,但更新的想法,在apache jackarta已加入該計劃以來。在當(dāng)時lucene的的核心團(tuán)隊,包括約半數(shù)的6個積極開發(fā)商,其中兩人是這本書的作者。在除了正式項目開發(fā)商, lucene有相當(dāng)大的技術(shù)用戶社區(qū)經(jīng)常積極貢獻(xiàn)補(bǔ)丁,修正錯誤,和新的觀點(diǎn)。誰使用lucene誰不呢?此外,這些組織就提到,在lucene的網(wǎng)頁有關(guān)于lucene的wiki上,其他一些知名的大型跨國組織正在使用lucene。它為eclipse的ide中提供了搜索功能,大英百科全書cd-rom/dvd ,聯(lián)邦快遞, mayo clinic的,惠普

9、公司,新科學(xué)家雜志,頓悟,麻省理工學(xué)院的開放式和dspace , akamai公司的edgecomputing平臺,等等。您的姓名也很快將在這個清單上。lucene的端口:perl , python, c + +用來判斷的開放源碼軟件成功的一個方法是由它被移植到其他的編程語言多少次來決定的。使用這種度量, lucene的是一個相當(dāng)成功的!雖然其原有l(wèi)ucene是寫在java中,作為本文件的編寫lucene的已移植到的perl , python, c + +和.net。其中一些已經(jīng)做了端口移植到了ruby 。這是極好的消息,為開發(fā)商誰需要訪問lucene從編寫的應(yīng)用程序在不同的語言。在第9章您可

10、以了解更多關(guān)于一些相關(guān)端口。索引和搜索所有搜索引擎的中心概念是索引:處理原始數(shù)據(jù)到一個高效率的交叉參考對應(yīng),以方便快速的搜索。讓我們采取一個快速的高級別看看雙方的索引和搜索過程。什么是索引,為什么它重要?假設(shè)你需要搜索大量的文件,你要能夠找到的文件中所載的某一個字或一個詞組。你會如何去撰寫程式來這樣做呢?一簡單的做法將是按順序掃描每個文件所提供的詞或短語。這種做法有很多缺陷,最明顯的是,它沒有規(guī)模較大的檔案設(shè)置或檔案是非常大的情況下,這些索引是哪里來的??焖偎阉鞔罅康奈谋?,您必須先搜索到文本和它轉(zhuǎn)換為的格式,將可讓您快速搜尋,消除緩慢的順序掃描過程。這個轉(zhuǎn)換過程稱為索引,其輸出是所謂的一個搜索

11、。你能想到的一個索引作為一個能夠?qū)崿F(xiàn)快速隨機(jī)存取儲存數(shù)據(jù)結(jié)構(gòu),再轉(zhuǎn)換進(jìn)內(nèi)存。背后的概念,這是類似于在一本書底的目錄,它可讓您快速地找到網(wǎng)頁上討論某些專題。在該案件lucene,一索引是一個特別設(shè)計的數(shù)據(jù)結(jié)構(gòu),通常是儲存在檔案系統(tǒng)作為一套索引文件。我們涵蓋的結(jié)構(gòu)指數(shù)文件的詳細(xì)附錄b中,但現(xiàn)在只是覺得一個lucene的指數(shù)作為一種工具,使word中快速查找。什么是搜索?搜索過程中查找的話在某一索引來查找文件在哪里出現(xiàn)。搜索質(zhì)量通常是所描述的使用精度和召回指標(biāo)。措施如何,以及搜索系統(tǒng)發(fā)現(xiàn)的有關(guān)文件,精密的措施,以及如何系統(tǒng)過濾掉不相關(guān)的文件。但是,您必須考慮一些其他因素時來思考搜索。我們已經(jīng)提到速度

12、和能力,以快速搜索大量的文本。支持單一和multiterm疑問,短語查詢,通配符,結(jié)果排名。排序也很重要,因為我們要考慮的是一個友善的語法進(jìn)入這些疑問。 lucene強(qiáng)大的類庫提供了一系列的搜索功能,鐘bells,whistles太多了,以至于我們不得不擴(kuò)大搜索覆蓋包括3個章節(jié)(章節(jié)3 , 5 ,和6 ) 。understanding lucenedifferent people are fighting the same probleminformation overloadusing different approaches. some have been working on novel

13、 user interfaces, some on intelligent agents, and others on developing sophisticated search tools like lucene. before we jump into action with code samples later in this chapter, well give you a high-level picture of what lucene is, what it is not, and how it came to be.what lucene islucene is a hig

14、h performance, scalable information retrieval (ir) library. it lets you add indexing and searching capabilities to your applications. lucene is a mature, free, open-source project implemented in java; its a member of the popular apache jakarta family of projects, licensed under the liberal apache so

15、ftware license. as such, lucene is currently, and has been for a few years, the most popular free java ir library.as youll soon discover, lucene provides a simple yet powerful core api that requires minimal understanding of full-text indexing and searching. you need to learn about only a handful of

16、its classes in order to start integrating lucene into an application. because lucene is a java library, it doesnt make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications.people new to lucene often mistake it for a ready-to-use appli

17、cation like a file-search program, a web crawler, or a web site search engine. that isnt what lucene is: lucene is a software library, a toolkit if you will, not a full-featured search application. it concerns itself with text indexing and searching, and it does those things very well. lucene lets y

18、our application deal with business rules specific to its problem domain while hiding the complexity of indexing and searching implementation behind a simple-to-use api. you can think of lucene as a layer that applications sit on top of, as depicted in figure 1.5.a number of full-featured search appl

19、ications have been built on top of lucene. if youre looking for something prebuilt or a framework for crawling, document handling, and searching, consult the lucene wiki “powered by” page (/jakarta-lucene/poweredby) for many options: zilverline, searchblox, nutch, larm, and jsea

20、rch, to name a few. case studies of both nutch and searchblox are included in chapter 10.what lucene can do for youlucene allows you to add indexing and searching capabilities to your applications (these functions are described in section 1.3). lucene can index and make searchable any data that can

21、be converted to a textual format. as you can see in figure 1.5.lucene doesnt care about the source of the data, its format, or even its language, as long as you can convert it to text. this means you can use lucene to index and search data stored in files: web pages on remote web servers, documents

22、stored in local file systems, simple text files, microsoft word documents, html or pdf files, or any other format from which you can extract textual information.similarly, with lucenes help you can index data stored in your databases, giving your users full-text search capabilities that many databas

23、es dont provide. once you integrate lucene, users of your applications can make searches such as +george +rice -eat -pudding, apple pie +tiger, animal:monkey and food:banana, and so on. with lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pa

24、ges the list goes on.history of lucenelucene was originally written by doug cutting;2 it was initially available for download from its home at the sourceforge web site. it joined the apache software foundations jakarta family of high-quality open source java products in september 2001. with each rel

25、ease since then, the project has enjoyed increased visibility, attracting more users and developers. as of july 2004, lucene version 1.4 has been released, with a bug fix 1.4.2 release in early october. table 1.1 shows lucenes release history.table 1.1 lucenes release historyversionrelease datemiles

26、tones0.01march 2000first open source release (source forge)1.0october 2000 1.01bjune 2001last source forge release1.2june 2002first apache jakarta release1.3december 2003compound index format, query parser enhancements, remote searching, token positioning, extensible scoring apisearching, token

27、 positioning, extensible scoring api1.4july 2004sorting, span queries, term vectors1.4.1august 2004bug fix for sorting performance1.4.2october 2004index searcher optimization and misc.fixes1.4.3winter 2004misc.fixestable1-1note : lucenes creator, doug cutting, has significant theoretical and practic

28、al experience in the field of ir. hes published a number of research papers on various ir topics and has worked for companies such as excite, apple, and grand central. most recently, worried about the decreasing number of web search engines and a potential monopoly in that realm, he created nutch, t

29、he first open-source world-wide web search engine (); its designed to handle crawling, indexing, and searching of several billion frequently updated web pages. not surprisingly, lucene is at the core of nutch; section 10.1 includes a case study of how nutch leverages lucene. 

30、doug cutting remains the main force behind lucene, but more bright minds have joined the project since lucenes move under the apache jakarta umbrella. at the time of this writing, lucenes core team includes about half a dozen active developers, two of whom are authors of this book. in addition to th

31、e official project developers, lucene has a fairly large and active technical user community that frequently contributes patches, bug fixes, and new features.who uses lucenewho doesnt? in addition to those organizations mentioned on the powered by lucene page on lucenes wiki, a number of other large

32、, well-known, multinational organizations are using lucene. it provides searching capabilities for the eclipse ide, the encyclopedia britannica cd-rom/dvd, fedex, the mayo clinic, hewlett-packard, new scientist magazine, epiphany, mits opencourseware and dspace, akamais edgecomputing platform, and s

33、o on. your name will be on this list soon, too.lucene ports: perl, python, c+, .net, rubyone way to judge the success of open source software is by the number of times its been ported to other programming languages. using this metric, lucene is quite a success! although the original lucene is w

34、ritten in java, as of this writing lucene has been ported to perl, python, c+, and .net, and some groundwork has been done to port it to ruby. this is excellent news for developers who need to access lucene indices from applications written in different languages. you can learn more about some of th

35、ese ports in chapter 9.indexing and searchingat the heart of all search engines is the concept of indexing: processing the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching. lets take a quick high-level look at both the indexing and searching process

36、es.what is indexing, and why is it important?suppose you needed to search a large number of files, and you wanted to be able to find files that contained a certain word or a phrase. how would you go about writing a program to do this? a naïve approach would be to sequentially scan each file for

37、 the given word or phrase. this approach has a number of flaws, the most obvious of which is that it doesnt scale to larger file sets or cases where files are very large. this is where indexing comes in: to search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. this conversion process is called indexing, and its output is called an index.you can think of an index as a data structure that allows fast random access to words stored inside it. the concept behin

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論