




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、Proceedings of 21*3 National Conference on Challenges & Opportunities in Iofonuation Technology (C01T-2008) RIMT-IET, Mandi Gobmdgarh. March 29, 200823013151617181920212 3 2 71 fL rlS. Chakrabarti, K. Puncra, M. Subramanyam, “Accclcralcd focuscd crawling llirough online rclcvancc feedback”,WWW 2002,
2、 pp. 148-159.M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles. M. Gon, Tocused Crawling Using Context Graphs”, Vl.DB 2000. pp. 527-534.C. Agganval, I*. Al-Garawi, P. Yu, Intelligent crawling on lhe World Wide Web with arbitrary predicates*, WTVW2001. pp. 96-105.C. Chung, C. Clarke, *Topic-orientci
3、1 collaborative crawling*, CIKM2002, pp. 3f2Brin. Sergey and Page LawTcncc. *Thc anatomy of a largc-scalc hvpcrtcxlual Web scarch cngmc”. Computer Networks and ISDN Systems, April 1998Grossan, B. “Search Engines: Whal !hey are. how they work, and practical suggestions for getting the most out of the
4、m,” February1997.Chakrabarfi, Souicn.,Mining the Web: Analysis of Hypertext and Semi StructuredDaia 2003.Jun Hirai, Snram Raghavan, Hector Garcia-.Molina, and Andreas Pacpckc, WebBasc: A repository of Web pages. In Proceedings of the Ninth International World Wide Web Conference, pages 277-293,May 2
5、000.The liUcmct Archive, http: vww.archivc.orn Martijn Kosier. The Web Robots Pages. httninfo/Webcraw maknroicctsrobiits.robots.International World Wide Web Conference, |agcs 79- 90, 1994.25. Brian Pinkerton. Finding What People Want: Experiences wilh the WebCrawler. In Proceedings of the Second Iii
6、teniational World Wide Web Conference, 1994.26J. Mike Burner. Crawling towards lilcmity: Building an archive of !hc World Wide Web. Web Techniques Magazine, 2(5), May 1997.27J. Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through URL ordering.html 24J. Oliver A. for 1anii
7、ngMcBryan. GENVL and WWWW: Tools lhe Web. Ui Proceedings of the FirstProcccdtTigs of 廣 National Cor.fcrmcc on Challenges & Opportunities in Information Technology (COIT-20081 RJMT-IET. Mandi Gobmdgarh. March 29,2008Discussion on Web Crawlers of Search EngineM.P.S.Bhalia*, Divya Gupta*Netaji Subhas I
8、nstitute of Technology, University of Delhi, India, *Guru Prem Sukh Memorial College of Engineering. GGSIP University, Delhi229Abstntct With the precipitous expansion of the Web. extracting knowledge fixtm the Web is becoming gradiuilly iwportatu tuid popular. This is due to the Web s convenience an
9、d richness of information. To find Web pages, one typically uses search engines fhat are based on the IVeb crawling framework. This paper describes the basic task performed search engine. Overview of how the Web crawlers are related with search engine.Keywords Distributed Crawling. Focused Crawling,
10、 Web CrawlersI. INTRODUCTIONWW on the Web is a scrvicc that resides on computers that arc coiincctcd !o the Ljilcmcl and allows end users lo access dala that is stored on the computers using slandard inicrfacc software. The World Wide Web is the universe of nelwork-accessible information, an embodim
11、ent of human knowledge.Search engine is a computer program that searches for particular keywords and returns a lisl of documents in which they were found, especially a commercial scrvicc lhal scans documents on the Internet. A search engine finds infomialion for its database by accepting listings sc
12、nl in by authors who waiil exposure, or by gelling the information from their M Web crawlers, spiders, or robots,” programs that roam the Lnlemel storing links lo and mfomiatiou about each page they visit 6J.ySfeb Crawler is a program, which fctchcs information from 丨h(huán)e World Wide Web in an automate
13、d manner. Web crawling 32 is an important rcscarch issue. Crawlers arc software components, which visit porlions of Web trees, according lo ccrlain strategies, and collect retrieved objccls in local repositories 7,The rest of the paper is organized as: in Section 2 we explain the background details
14、of Web crawlcrs. In Section 3 wc discuss on types of crawler, in Section 4 we will explain the working of Web crawler. Ill Section 5 we cover the Iwo advanced techniques of Web crawlers. In the See lion 6 we discuss ihc problem of selecting more inlereslmg pages.II. II. SURVEY OF WliB CRAWLIiRSWeb c
15、rawlers are almost as old as the Web itself 23. Ihc first crawicr, Matthew Grays Wanderer, was wTitlcn imhepring of 1993, roughly coinciding with the first release oCSA Mosaic. Several papers about Web crawling were prescnlcd al Ihc first Iwo World Wide Web conferences 29, 24, 25J. However, at Ihe l
16、ime, the Web was Ihree to four orders of magnitude smaller than it is today, so those systems did not address the scaling problemii inhercul iu a crawl of todays Web.Obviously, all of ihe popular scarch engines use crawlcrs (hat must scale up to substantial porhons of the Web. However, due to the co
17、mpel itivc nature of Ihc scarch engine business, Ihc designs of these crawlcrs have noi been publicly describedThere are Iwo notable exceptions: Ihe Google crawler and the rnlcmcl Archive crawler. Unfortunaloly. the descriptions of these crawlers in the literature arc too terse lo enable reprodiicib
18、ilily.The original Google crawler (developed al Staniord) consisted of five functional compouenb running in difiereni processes. A URL server process read URLs out of a file and forwarded I hem to multiple crawler processes. Each crawler process ran on a different machmc, was single-threaded, aud us
19、ed asynchronous I/O lo fetch data from up to 300 Web servers m parallel. The crawlcrs transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk. The pages were then read back from disk by an indexer process, which extracted links from II IM L p
20、ages and saved (hem lo a difterenl disk file. A URL resolver process read Ihe lrnk file, derelativized the URLs contained there in, iind saved Ihe absolute URLs lo the disk file that was read by the URL server. Typically, three to four crawicr machines were used, so the entire system rcqmrcd between
21、 four and eight machines.Research on Web crawling continues al Stanlbrd even after Google has been liansformcd into a commcrcial effort. The Stanford WebBase projec! has implemented a high- performance distribuleii crawler, enable of downloiuiing 50 to 100 documents per second 21. Cho and others hav
22、e also developed models of documcnl update frequencies to inform llie download schedule of incremental crawlers 23J.The Internet Archive also used multiple machines to craw! the Web 26, 22. Each crawler process was assigned uplo64 sites to crawl, and no site was assigned to more than one crawicr. Ea
23、ch single-threaded crawicr process read a lisl of seed URLs for its assigned sites from disk into per-sile queues, and then used asynchronous LO lo fetch pages from these queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it. If a link referred to Ihe site o
24、f the page it was contained m, it was added to the appropriate silc queue: otherwise it was logged to disk. Periodically, a batch process merged Ihese logged “cross-sile” URLs into Ihe site-specific seed sets, filtering out duplicates in the process.Proceedings of 21*3 National Conference on Challen
25、ges & Opportunities in Iofonuation Technology (C01T-20081 RIMT-IET, Mandi Gobmdgarh. March 29, 2008The WcbFountain crawlcr shares several of Mcrcalors characteristics: it is distributed, continuous (the authors use the term incremental*), polile, and con- figurablc 28, Unfortunately, as of this writ
26、ing, WebFouiKain is in lhe early stages of its development, and data about its performance is not yet availableIII. BASIC TYPES OF SEARCH ENGINEA. Crawler Based Search EnginesCrawlcr based scarch engines creatc their listings aulormtlically. Computer programs spiders build Ihcm not by human selectio
27、n 31J. They arc not organized by subjccl categories: a compulcr algorithm ranks all pages. Such kinds of search engines arc huge and often retrieve a lot of information for complcx scarchcs it allows to scarch within the results of a previous search and enables you to refine search resulls. These ly
28、pes of search engines coutaui full texl of the Web pages they link lo. So one can find pages by matching words in lhe pages one wants 15.If. Human Powered DirectoriesThese arc built by human selection i.e. they depend on humans to create listings. They are organized into subject categories and subje
29、cts do classification of pages. Human powered directories never contain Hill text of the Web page they link to. They arc smaller than most scarch engines 16J.C. Hybrid Search EngineA hybrid search engine differs from traditional text oriented scarch engine such as Google or a directory-based scarch
30、engine such as Yahoo in which each program operates by comparing a set of metadata, the primary corpus being lhe nictaciata derived from a Web crawler or taxonomic analysis of all inlcmci icxl, and a user scarch query, fn contrast, hybrid scarch engine may use these two bodies of metadata in additio
31、n lo one or more sets of metadata that can, for example, include situational metadata derived trom lhe clients network tlial would model the context awareness of lhe client.IV. WORKING OF A WEB CRAWLERWeb crawlers are an essential component lo search engines: running a Web crawlcr is a challenging t
32、ask. There arc tricky jxrrformancc and reliability issues and even more importanlly, there arc social issues. Crawling is the niosl fragile application sincc it involves interacting with hundreds of ihousaiKis of Web servers and various name servers, which are all beyond (he control of lhe system. W
33、eb crawling speed is governed not only by the speed of ones own Internet connection, but also by the speed of the sites that arc to be crawlcd. Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel.Des
34、pite the numerous applications for Web crawlcrs. at the corc they arc all fundamentally lhe same. Following is the process by which Web crawlers work:1 Download the Web page.2. Parse through the downloaded page and retrieve all the links.3. For each link retrieval, repeal (he process.The Web crawler
35、 can be used for crawling through a whole site on the Inter-TntranetYou spccify a start-URL and the Crawlcr follows all links found in that HTML page. This usually leads lo more links, which will be followed again, and so imi. A site can be seen as a tree-structure, lhe root is the slarl-URL; all li
36、nks in that rool- HTML-page arc tlircct sons of the root. Subsequent links are then sons of lhe previous sons.A single URL Server serves lists of URLs to a number of crawlers. Web crawler starts by parsing a specified Web page, noting any hypertext links on thal page that point to other Web pages. T
37、hey then parse those pages for new links, and so on, recursively. WebCrawler software docsiil actually move around lo difcrcnl computers on the hilcrael. as viruses or intelligent agents do. Each crawlcr keeps roughly 300 connections open at oncc. This is ncccssary to retrieve Web pages al a fast en
38、ough pacc. A crawlcr resides on a single machine. The crawlcr simply sends HTTP requests for documents lo other machines on the Lnlemcl, just as a Web browser docs when the user clicks on links. All the crawlcr really does is lo automate the process of following links.Web crawling can be regarded as
39、 processing items in a queue. When the crawlw visits a Web page, it extracts links to other Web pages. So lhe crawlcr puts these URLs at lhe end of a queue, and conlinues crawling to a URL that it removes from lhe front of the queue 1.A. Resource ConstraintsCrawlcrs consumc resources: network bandwi
40、dth to download pages, memory lo niainlain private data structures in support of llieir algorithms, CPU to evaluate and select URLs, and disk storage !o store tfie lexl and links of fetched pages as well as other persistent data.B. Robot ProtocolThe robot.txt file gives directives for excluding a po
41、rtion of a Web site lo be crawlcil. Analogously, a simple tcx! file can furnish information about the freshness and popularity of published objects. This information permits a crawlcr to optimize its strategy for refreshing collected dala as well as replacing object policy.C. Meta Search EngineProce
42、edings of 21*5 National Conference on Challenges & Opponucilies in Inionuattor. Techcology (COIT-200S) RJMT-IET. MamU Gobmdgarh. March 29, 2008A nicta-scarch engine is (he kind of search engine that does not have its own database of Web pages. It sends search terms io lhe databases mainlamed by othe
43、r search engines and gives users lhe result that come from all the search engines queried. Fewer meta searchers allow you to delve into the largest, most useful search engine databases. They tend toreturn results from smaller and/or free search engines and miscellaneous free directories, often small
44、 and highly commcrcial.V. CRAWLING TECHNIQUESA. Focused CrawlingA general purpose Web crawler gathers as many pages asilcan from a particular set of URLs. Where as a focused crawler is designed lo only gather documents on a specific topic, thus reducing the amount of network traffic and downloads. I
45、hc goal of Ihc foe used crawicr is to selectively out pages thal are relevant lo a pre-defined set of topics, topics are specified nol using keywords, but using exemplary documents.Rather than collecling and indexing all accessiblc AVeb documents lo be able lo answer all possible ad-hoc queries, a I
46、ocusctl crawicr analyzes its crawl boundary lo find ihc links lhal arc likely lo be most relevant for the crawl, and avoids irrelevant regions of Ihe Web.This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focuscd crawicr has three main
47、components: a classificr, which makes relevancc judgments on pages, crawled to decide on link expansion, a distiller which delemunes a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurablc priority controls which is governed by the classi
48、ficr and distiller.I hc most crucial evaluation of focuscd crawling is to measure the harvest ralio, which is rale al which relevant pages are acquired aiui irrelevant pages arc effectively filtered off from ihe crawl. This harvesl ratio musl be high, otherwise the focused crawler would spend a lot
49、of lime merely eliminating irrelevant pages, and it may be better lo use an ordinary crawicr instead 17.B. Distributed CrawlingIndexing the Web is a challenge due to its growing and dynamic nature. As Ihc size of IhcWeb is growing it has become imperative to parallelize the crawling process in order
50、 to finish downloading Ihe pages in a reasonable amount of time. A single crawling process even if multithrcadmg is used will be insufficient for large scale engines that need to fetch large amounts of data rapidly. When a single centralized crawicr is used all the fetched data passes through a sing
51、le physical link. Dislnbuting the crawling activity via multiple M-ocesscs can help build a scalable, easily configurable system, which is fault tolerant system. Splitting Ihe load decreases hardware requirements and at Ihc same lime increases Ihc overall download speed and reliability. Ivach task i
52、s performed in a iully distributed fashion, that is, no ccntral coordinator exists 3J.VI. IROBLKM OF SELRCTING MORI- “IN ITiRf:S HNCi” OBJECTSA scarch engine is aware of hot topics bccausc il collccts user queries. The crawling proccss pnontizes URLs according lo an importance metric such as similar
53、ity (lo a driving query), back-link count, Page Rank or their combmationsariations 8J, 9J Rcccnlly Najork cl al. showed that brcadlh-firsl scarch collects high-quality pages first and suggested a variant of Page Rank 10. However, al Ihe moment, search strategies are unable to exactly select the *bes
54、t” paths because 丨h(huán)eir knowledge is only partial. Due to the enormous amount of information available on the Internet a total-crawling is at the moment impossible, thus, prune strategies must be applied. Focused crawling 11, 12 and intelligent crawling 13J, are techniques for discovering Web pages r
55、elevant lo a specific topic or scl of topics 14.CONCLUSIONIn this paper we conclude thal complete web crawling coverage cannot be achieved, due lo Ihe vast size of Ihe whole WWW and to resource availability. Usually a kind of threshold is set up (number of visited URLs, level in Ihc websile tree, co
56、nipliancc with a topic, etc.) lo limit Ihc crawling process over a selcclcd wcbsilc. This infomialion is available in scarch engines to slore/refrcsh most relevant and updated web pages, thus improving quality of retrieved contents while reducing stale contcnt and missing pages.REFERENCES1 .Garcia-M
57、olina, Hector. Searching Ihc Web. August 2001 hllp://-chopaperii/cho-loil01 .pdf2 .Grossan, B. “Scarch Engines: What they arc, how they work, and practical suggestions for getting Ihc most out of them,” February 1997.3 .http:/www.W4 .Baldi, Pierre. Modeling the rntemct and the Web: I*
58、robabilistic Methods and Algorithms, 2003.5 .Pant, Gaulam, Padmini Srinivasaii and Filippo Meiiczer: Crawling ihe Web, 2003.6J. http: /-pant Taperscrawling.pdf7. Chakrabarli, Soumcn. Mining Ihc Web: Analysis of Hypertext and Senu Structured Data, 2003.8J. hltp:/www.google.co.in/9j. Marina Buzzi, Cooperative crawling Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0- 7695-2058-8/03 $17.00 K 2003 IEEE10 .J. Cho. H. Garcia-Moiina, L. Page, “Ellkicnl Crawling Ihrough LrRL Ordering”. WWW7.
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度教育培訓檔口租賃合同
- T-ZJCX 0046-2024 簾子線直捻機
- 二零二五年度公車私用行為規(guī)范與責任追究協(xié)議
- 二零二五年度全新碼頭租賃協(xié)議及倉儲服務(wù)合作協(xié)議
- 2025年度果園租賃與農(nóng)業(yè)科技研發(fā)合同
- 二零二五年度廣告代理合同解除與權(quán)益調(diào)整協(xié)議
- 2025年度高科技企業(yè)計件工資勞動合同
- 2025年度智能合同履約跟蹤與風險控制管理辦法
- 2025年度消防設(shè)施定期維護與消防通道清理合同
- 二零二五年度美發(fā)店員工勞動健康保險與意外傷害合同
- 2025年湖南高速鐵路職業(yè)技術(shù)學院單招職業(yè)傾向性測試題庫附答案
- 2025屆高考英語二輪復習備考策略課件
- 《高鐵乘務(wù)安全管理與應(yīng)急處置(第3版)》全套教學課件
- 歷年湖北省公務(wù)員筆試真題2024
- 2.2 說話要算數(shù) 第二課時 課件2024-2025學年四年級下冊道德與法治 統(tǒng)編版
- 《工程勘察設(shè)計收費標準》(2002年修訂本)
- 潔凈室空調(diào)凈化系統(tǒng)驗證方案(通過BSI和華光審核)
- 2024年電力交易員(中級工)職業(yè)鑒定理論考試題庫-下(多選、判斷題)
- 數(shù)學物理方程(很好的學習教材)PPT課件
- 電力建設(shè)工程質(zhì)量監(jiān)督檢查大綱新版
- GB-T-15894-2008-化學試劑-石油醚
評論
0/150
提交評論