版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、外文資料原文Efficient URL Caching for World Wide Web CrawlingMarc NajorkBMJ (ernational Edition)2009Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)(c). However, the size of the web (esti
2、mated at over 4 billion pages) and its rate of change (estimated at 7% perk) move this plan from a trivial programming exercise to a serious algorithmic and systemdesign challenge., these two factors alone implyt for a reasonably fresh and completecrawl of the web, step (a) must be executed abouhous
3、and times per second, and thus themembership test (c) must be dell over ten thousand times per second against a set toolarge to store in main memory. This requires a distributed architecture, which furthercomplicates the membership test.A crul way to speed up the test is to cache,t is, to storeaemor
4、y a(dynamic) subset of the “seen” URLs. The main goal of this pr is to carefully investigateseveral URL caching techniques for web crawling. We consider both practical algorithms:random replacement, sic cache, LRU, and CLOCK, and theoretical limits: clairvoyantcaching and infinite cache.erformed abo
5、ut 1,800 simulations using these algorithms withvarious cache sizes, using actual log data extracted from a massive 33 day web crawlt iedover one billion HTTP requests. Our mainist caching is very effective in oursetup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%.erestingl
6、y, thiscache size fallsritical po: a substantially smaller cache is much less effective while asubstantially larger cache brings little additional benefit. We conjecturet such critical posare inherent to our problem and venture an explanation for this phenome.1.RODUCTIONA recent Pew Foundation study
7、 31 sest “Search engines havee anindispensable utility forernet users” and estimatest as of mid-2002, slightly over 50% ofall Americans have used web search to find information. Hence, the technologytersweb search is of enormous practical search technology, namely the prosearch engine corpus.erest.h
8、is pr, we concentrate oe aspect of thes of collecting wgest eventually constitute theSearch engines collect pages, and URL extraction fromany ways, among them direct URL submis, paidincluweb sour, but the bulk of the corpus is obtained byrecursively exploring the web, a proiss known as crawling or S
9、ERing. The basic algorithm(a) Fetch a page(b) Parse it to extract all linked URLs(c) For all the URLs not seen before, repeat (a)(c)Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls.So
10、metimes crawls are started from a single well connected page, or a directory such as, buthis case a relatively large portion of the web (estimated at over 20%) isnever reached. See 9 for a discusof the graph structure of the webt leads to thisphenome.If we view wthese nodes, then crawlingges as node
11、s in a graph, and hyperlinks as directed edges amonges a pros knownathematical circles as graph traversal.Various strategies fraph traversal differheir choice of which node among the nodes notyet explored to explore next. Two standard strategies fraph traversal are DepthSearch(DFS) and BreadthSearch
12、 (BFS) they are easy to implement and taught in manyroductory algorithms classes. (See for instance 34).However, crawling the web is not a trivial programming exercise but a seriousalgorithmic and system design challenge because of the followino factors.1. The web is very large. Currently, pages. Va
13、rious studies 3, 27, 28 have indicated9-12 months.20 claims to have indexed over 3 billiont, historically, the web has doubled every2. Wges are changing raly. If “change” means “any change”, then about 40%of all wges changekly 12. Even if we consider only pagest change by a third ormore, about 7% of
14、 all wges changekly 17.These two factors implyt to obtain a reasonably fresh and 679 complete snapshotof the web, a search engine must crawleast 100 million pages per day. Therefore, step (a)must be executed about 1,000 times per second, and the membership test in step (c) must bedmaell over ten tho
15、usand times per second, against a set of URLst is too large to store inemory. In addition, crawlers typically use a distributed architecture to crawl more pagesin parallel, which further complicates the membership test: it issiblet the membershipquestion caly be answered byer node, not locally.A cru
16、“seen” URLsl way to speed up the membershipto cache a (dynamic) subset of ther is to investigate in depth severalaemory. The main goal of this pURL caching techniques for web crawling.xamined four practical techniques: randomreplacement, sic cache, LRU, and CLOCK, and compared them against two theor
17、etical limits:clairvoyant caching and infinite cache when run againsrace of a web crawlt ied overone billion HTTP requests. We foundeven at relatively small cache size implemented very efficiently.t simple caching techniques are extremely effectivech as 50,000 entries and show how these caches can b
18、eThe pr isanized as follows: Section 2 discusses the various crawling solutionsproedhe literature and how caching fitsheir m. Section 3 presents anroductionto caching techniques and describes several theoretical and practical algorithms for caching.We implemented these algorithms under the experimen
19、tal setup described in Section 4. Theresults of our simulations are dcted and discussed in Section 5, and ourmendationsfor practical algorithms and data structures for URL caching are presented in Section 6. Section7 contains ours and directions for further research.2. CRAWLINGWeb crawlers are almos
20、t as old as the web itself, and numerous crawling systems havebeen describedhe literature.his section,resent a brief survey of these crawlers (inhistorical order) and then discuss why most of these crawlers could benefit from URL caching.The crawler used by theernet Archive 10 employs multiple crawl
21、ing proses,each of which performs an exhaustive crawl of 64 hosts aime. The crawling proses save-local URLs to disk; atsets of the next crawl.of a crawl, a batch job adds these URLs to the per-host seedThe originalcomponents as different procrawler, described in 7, implements the different crawlerse
22、s. A single URL servros maains the set of URLs todownload; crawling proses fetch pages; indexing proses extract words and links; andURL resolvroses convert relativeo absolute URLs, which are then fed to the URLServer. The various proses communicate via the file system.For the experiments describedhi
23、s pr, we used the Mercator web crawler 22,29. Mercator uses a set of independent, communicating web crawlroses. Each crawlerpropros is responsible for a subset of all web servers; the assignment of URLs to crawlerses is based on a hash of the URLs host component. A crawlert discovers an URLfor which
24、 it is not responsible sends this URL via TCP to the crawlert is responsible for it,batching URLs together to minimize TCP overhead. We describe MercatorSection 4.ore detail inCho and Gar-Molinas crawler 13 is similar to Mercator. The system is comedof multiple independent, communicating web crawlro
25、ses (called “C-procs”). Cho andGar-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigningan URL to a C-proc based on a hash of the URLs host part), and hierarchical (assigning anURL
26、 to a C-proc based on some property of the URL, such as its top-level).The WebFountain crawler 16 is also comed of a set of independent,communicating crawling proses (the “ants”). An antt discovers an URL for which it isnot responsible, sends this URL to a dedicated proURL to the appropriate ant.s (
27、the “controller”), which forwards theUbiCrawler (formerly known as Trovatore) 4, 5 is again comed of multipleindependent, communicating web crawlroses. It also employs a controllros whichoversees thecrawling procrawling proses.ses, detects pros failures, and initiates fail-over to otherShknyuk and S
28、uels crawler 35 is similar tos; the different crawlercomponents are implemented as different proses. A “crawling application” maains the setof URLs to be downloaded, and schedules the order in which to download them. It sendsdownload requests to a “crawl manager”, which forwards them to a pool of “d
29、ownloader”proses. The downloadroses fetch the pages and save them to an NFS-mounted filesystem. The crawling application reads those saved pages, extracts any links contained withinthem, and adds them to the set of URLs to be downloaded.Any web crawler must maain a collection of URLst are to be down
30、loaded.Moreover, since it would be unacceptable to download the same URL over and over, it musthave a way to avoid adding URLs to the collection morece. Typically, avoidance isachieved by maaining a set of discovered URLs, covering the URLshe frontier as wellemory (which itas thoseoften is, givenve
31、already been downloaded. If this set is too large to fitt there are billions of valid URLs), it is stored on disk and caching popularURLsemory is a win: Caching allows the crawler to discard a large fraction of the URLswithouving to consult the disk-based set.Many of the distributed web crawlers des
32、cribedWebFountain 16, UbiCrawler4, and Cho and Molinasabove, namely Mercator 29,crawler 13, are comprised of ges, extracts their links, andcooperating crawling proses, each of which downloads wsends these links to the peer crawling pros responsible for it. However, there iseed tosend a URL toer craw
33、ling pros morece. Maaining a cache of URLs andconsultingt cache before sending a URL toer crawler goes a long way toward reducingtransmiss to peer crawlers, as we showhe remainder of this pr.3. CACHINGost computer systems, memory is hierarchical,t is, there exist two or more levelsof memory, represe
34、nting different tradeoffs betn size and speed. For instance, in a typicalworksion there is a very small but very fast on-chip memory, a larger but slower RAMmemory, and a very large and much slower disk memory. In a network environment, thehierarchy continues with network acsible storage and so on.
35、Caching is the idea of storingfrequently used items from a slower memory in a faster memory.he right circumstan,caching grey improves the performance of the overall system and hence it is a fundamentaltechniquehe design of operating systems, discussedength in any standard textbook 21,37.he web conte
36、xt, caching is often mentionedhe context of a web proxy caching wcontext, since the number of visited URLsges 26, Chapter 11. In our web crawleres too large to storeaemory, we storethe collection of visited URLs on disk, and cache a small portionaemory.Caching terminology is as follows: the cache is
37、 memory used to store equal sizedatomic items. A cache has size k if it can store at most k items.1 At eacit of time, the cachereceives a request for an item. If the requested item ishe cache, the situation is called a hitand no further action is needed. Otherwise, the situation is called a miss or
38、a fault. If the cachehas fewern k items, the missed item is added to the cache. Otherwise, the algorithm mustchoose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goalof
39、the caching algorithm is to minimize the number of misses.Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performanceof a caching algorithm is characterized by the miss ratio for a given size cache. In general,caching icsful for two reasons:_-uniformity of requests. S
40、ome requests are muore popularn others. Inour context, for instance, a link toto the authors home pages.is a muore common occurrencen a link_ Temporal correlation or locality of reference. Current requests are more likely toduplicate requests madehe recent pastn requests made long ago. The latter te
41、rminologycomes from the computer memory m data needed now is likely to be closehe addressbecause linksspace to data recently needed. In our context, temporal correlation occurstend to be repeated on the same page we foundt oage about 30% are duplicates, cf.Section 4.2, and second, because pages on a
42、 given host tend to be explored sequentially andthey tend to share many links. For exle, many pages on a Computer Science departmentserver are likely to share links to other Computer Science departmentsprs, etc.he world, notoriousBecause of these two factors, a cachet contains popular requests and r
43、ecent requestsis likely to perform betteruition in various ways.n an arbitrary cache. Caching algorithms try to capture thisWe now describe some standard caching algorithms, whose performancein Section 5.valuate附錄B漢語(yǔ)翻譯基于網(wǎng)絡(luò)爬蟲(chóng)的有效 URL 緩存BMJ (國(guó)際版)2009概要:要在網(wǎng)絡(luò)上爬行非常簡(jiǎn)單:基本的算法是:(a)取得一個(gè)網(wǎng)頁(yè)(b)它提取所有的URLs(c)對(duì)于所有沒(méi)有
44、見(jiàn)過(guò)的 URLs 重復(fù)執(zhí)行(a)-(c)。但是,網(wǎng)絡(luò)的大?。ü烙?jì)有超過(guò) 40 億的網(wǎng)頁(yè))和他們變化的頻率(估計(jì)每周有 7%的變化)使這個(gè)計(jì)劃由一個(gè)微道的設(shè)計(jì)習(xí)題變成一個(gè)非常嚴(yán)峻的算法和系統(tǒng)設(shè)計(jì)。實(shí)際上,光是這兩個(gè)要素就意味著如果要進(jìn)行及時(shí)地,完全地爬行網(wǎng)絡(luò),步驟(a)必須每秒鐘執(zhí)行大約 1000 次,因此,成員檢測(cè)(c)必須每秒鐘執(zhí)行超過(guò) 10000 次,并有非常大的數(shù)據(jù)到主內(nèi)存中。這個(gè)要求有一個(gè)分布式構(gòu)造,使得成員檢測(cè)更加復(fù)雜。一個(gè)非常重要的方法加速這個(gè)檢測(cè)就是用 cache(高速緩存),這個(gè)是把見(jiàn)過(guò)的 URLs 存入主內(nèi)存中的一個(gè)(動(dòng)態(tài))子集中。這個(gè)最主要的成果就是仔細(xì)的研究了幾種關(guān)于網(wǎng)絡(luò)
45、爬蟲(chóng)的 URL 緩存技術(shù)LRU,和 CLOCK,和理論極限:考慮所有實(shí)際的算法:隨機(jī)置換,靜態(tài) cache,cache 和極大的 cache。執(zhí)行了大約 1800 次模擬,用不同的 cache 大小執(zhí)行這些算法,用真實(shí)的 log 日志數(shù)據(jù),獲取自一個(gè)非常大的 33 天的網(wǎng)絡(luò)爬行,大約執(zhí)行了超過(guò) 10 億次的http 請(qǐng)求。的主要的結(jié)論是 cache 是非常高效的-在的機(jī)制里,一個(gè)有大約 50000個(gè)的 cache 可以完成 80%的速率。有趣的是,這 cache 的大小下降到一個(gè)臨界點(diǎn):一個(gè)足夠的小一點(diǎn)的 cache 更有效當(dāng)一個(gè)足夠的大一點(diǎn)的 cache 只能帶來(lái)很小的額外好處。推測(cè)這個(gè)臨界
46、點(diǎn)是固有的并且冒昧的解釋一下這個(gè)現(xiàn)象。1.介紹皮尤的研究:“搜索引擎已經(jīng)成為互聯(lián)網(wǎng)用戶(hù)不可或缺的工具”,估計(jì)在 2002 年中期,初略有超過(guò) 1 半的人用網(wǎng)絡(luò)搜索獲取信息。因此,一個(gè)強(qiáng)大的搜索引擎技術(shù)有巨大的實(shí)際利益,在這個(gè)中,集中于一方面的搜索技術(shù),也就是搜集網(wǎng)頁(yè)的過(guò)程,最終組成一個(gè)搜索引擎的。搜索引擎搜集網(wǎng)頁(yè)通過(guò)很多途徑,他們中,直接提交 URL,回饋內(nèi)含物,然后從非 web 源文件中提取 URL , 但是大量的包含一個(gè)進(jìn)程叫 crawling 或者SERing,他們遞歸的探索互聯(lián)網(wǎng)?;镜乃惴ㄊ牵篎etch a pageParse it to extract all linked URL
47、sFor all the URLs not seen before,repeat(a)-(c)URLs。有些時(shí)候網(wǎng)絡(luò)爬蟲(chóng)開(kāi)始于一個(gè)正確連接的頁(yè),但是因?yàn)檫@個(gè)原因相關(guān)的巨大的部分網(wǎng)絡(luò)資源無(wú)法網(wǎng)絡(luò)怕從一般開(kāi)始于一些面,或者一個(gè)目錄就像:被到。(估計(jì)有超過(guò) 20%)如果把網(wǎng)頁(yè)看作圖中的節(jié)點(diǎn),把超看作定向的移動(dòng)在這些節(jié)點(diǎn)之間,那么網(wǎng)絡(luò)爬蟲(chóng)就變成了一個(gè)進(jìn)程就像數(shù)學(xué)中的圖的遍歷一樣。不同的遍歷策略決定著先不哪個(gè)節(jié)點(diǎn),下一個(gè)哪個(gè)節(jié)點(diǎn)。2 種標(biāo)準(zhǔn)的策略是深度優(yōu)先算法和廣度優(yōu)先算法-他們?nèi)菀妆粚?shí)現(xiàn)所以在很多入門(mén)的算法課中都有教。但是,在網(wǎng)絡(luò)上爬行并不是一個(gè)微道的設(shè)計(jì)習(xí)題,而是一個(gè)非常嚴(yán)峻的算法和系統(tǒng)設(shè)計(jì)因?yàn)橐韵?2 點(diǎn)原因:需要索引超過(guò) 30 億的網(wǎng)頁(yè)。很多研究都網(wǎng)絡(luò)非常的龐大?,F(xiàn)在,在歷史上,網(wǎng)絡(luò)每 9-12 個(gè)月都會(huì)增長(zhǎng)一倍。網(wǎng)絡(luò)的頁(yè)面改變很頻繁。如果這個(gè)改變指的是任何改變,那么有 40%的網(wǎng)頁(yè)每周會(huì)改變。如果認(rèn)為頁(yè)面改變?nèi)种换蛘?,那么有大約 7%的頁(yè)面每周會(huì)變。這 2 個(gè)要素意味著,要獲得及時(shí)的,完全的網(wǎng)頁(yè)快照,一個(gè)搜索引擎必須1 億個(gè)網(wǎng)頁(yè)每天。因此,步驟(a)必須執(zhí)行大約每秒 1000 次,成員檢測(cè)的步驟(c)必須每秒執(zhí)行超過(guò) 10000 次,并有非常大的數(shù)據(jù)到主內(nèi)存中。另絡(luò)爬蟲(chóng)一般使用一個(gè)分布式的構(gòu)造來(lái)
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 保護(hù)鼻子健康教案反思
- 角形的邊說(shuō)課稿
- 教師職業(yè)病健康知識(shí)講座
- 展覽合同終止合同協(xié)議范例
- 市政工程保溫板施工合同
- 消費(fèi)者權(quán)益爭(zhēng)議解決協(xié)議
- 房屋建筑施工合同審計(jì)
- 辦公樓廁所翻新合同樣本
- 家電企業(yè)會(huì)計(jì)人員聘用協(xié)議
- 酒店窗戶(hù)安裝施工協(xié)議
- 高層房建勘察報(bào)告-實(shí)際工程項(xiàng)目
- 外研版小學(xué)英語(yǔ)(一年級(jí)起點(diǎn))二年級(jí)上冊(cè)Module-7-Unit-2課件
- 教師帶實(shí)習(xí)生總結(jié)8篇
- 工程項(xiàng)目復(fù)盤(pán)模板(PPT)
- 《我國(guó)企業(yè)會(huì)計(jì)信息質(zhì)量的現(xiàn)狀、成因及治理對(duì)策(論文)7200字》
- 二十四節(jié)氣立春課件
- 職工轉(zhuǎn)移申請(qǐng)表
- 網(wǎng)絡(luò)安全檢查表模板
- 貴州省火力發(fā)電企業(yè)名錄2017年125家
- 二年級(jí)上冊(cè)科學(xué)二單元《材料》教材解讀
- 10-源代碼編譯指南
評(píng)論
0/150
提交評(píng)論