現(xiàn)代信息檢索導(dǎo)論英文課件:13-WebSearchBasics_第1頁(yè)
現(xiàn)代信息檢索導(dǎo)論英文課件:13-WebSearchBasics_第2頁(yè)
現(xiàn)代信息檢索導(dǎo)論英文課件:13-WebSearchBasics_第3頁(yè)
現(xiàn)代信息檢索導(dǎo)論英文課件:13-WebSearchBasics_第4頁(yè)
現(xiàn)代信息檢索導(dǎo)論英文課件:13-WebSearchBasics_第5頁(yè)
已閱讀5頁(yè),還剩97頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、An Introduction to IR13th CourseChapter 19 Web search basicsWeb characteristicsSearch use (iProspect Survey, 4/04, http:/premiumPDFs/iProspectSurveyComplete.pdf)Without search engines the web wouldnt scaleNo incentive in creating content unless it can be easily found other finding methods havent kep

2、t pace (taxonomies, bookmarks, etc)The web is both a technology artifact and a social environment “The Web has become the “new normal” in the American way of life; those who dont go online constitute an ever-shrinking minority.” Pew Foundation report, January 2005Search engines make aggregation of i

3、nterest possible: Create incentives for very specialized niche players Economical specialized stores, providers, etc Social narrow interests, specialized communities, etc The acceptance of search interaction makes “unlimited selection” stores possible: Amazon, Netflix, etcSearch turned out to be the

4、 best mechanism for advertising on the web, a $15+ B industry. Growing very fast but entire US advertising industry $250B huge room to growSponsored search marketing is about $10BClassical IR vs. Web IRBasic assumptions of Classical Information RetrievalCorpus: Fixed document collectionGoal: Retriev

5、e documents with information content that is relevant to users information needClassic IR GoalClassic relevanceFor each query Q and stored document D in a given corpus assume there exists relevance Score(Q, D)Score is average over users U and contexts COptimize Score(Q, D) as opposed to Score(Q, D,

6、U, C)That is, usually: Context ignoredIndividuals ignoredCorpus predeterminedBad assumptionsin the web contextWeb IRThe coarse-level dynamicsContent creatorsContent aggregatorsFeedsCrawlsContent consumersAdvertisementEditorialSubscriptionTransactionBrief (non-technical) historyEarly keyword-based en

7、ginesAltavista, Excite, Infoseek, Inktomi, ca. 1995-1997Paid placement ranking: G (morphed into O Yahoo!)Your search ranking depended on how much you paidAuction for keywords: casino was expensive!Brief (non-technical) history1998+: Link-based ranking pioneered by GoogleBlew away all early engines s

8、ave InktomiGreat user experience in search of a business modelMeanwhile Goto/Overtures annual revenues were nearing $1 billionResult: Google added paid-placement “ads” to the side, independent of search resultsYahoo follows suit, acquiring Overture (for paid placement) and Inktomi (for search)Algori

9、thmic results.AdsAds vs. search resultsGoogle has maintained that ads (based on vendors bidding for keywords) do not affect vendors rankings in search resultsSearch =mieleAds vs. search resultsOther vendors (Yahoo, MSN) have made similar statements from time to timeAny of them can change anytimeWe w

10、ill focus primarily on search results independent of paid placement adsAlthough the latter is a fascinating technical subject in itselfWeb search basicsThe WebAd indexesWeb spiderIndexerIndexesSearchUserUser NeedsNeed Brod02, RL04Informational want to learn about something (40% / 65%)Navigational wa

11、nt to go to that page (25% / 15%)Transactional want to do something (web-mediated) (35% / 20%)Access a serviceDownloads ShopGray areasFind a good hubExploratory search “see whats there” Low hemoglobinUnited AirlinesSeattle weatherMars surface imagesCanon S410 Car rental BrasilWeb search usersMake il

12、l defined queriesShort AV 2001: 2.54 terms avg, 80% 3 words)AV 1998: 2.35 terms avg, 88% 3 words Silv98Imprecise termsSub-optimal syntax (most queries without operator)Low effort Wide variance inNeedsExpectationsKnowledgeBandwidthSpecific behavior85% look over one result screen only (mostly above th

13、e fold)78% of queries are not modified (one query/session)Follow links “the scent of information” .Query DistributionPower law: few popular broad queries, many rare specific queriesHow far do people look for results?(Source: WhitePaper_2006_SearchEngineUserBehavior.pdf)True example* CorpusTASK Info

14、NeedQuery Verbal formResultsSEARCHENGINEQueryRefinementNoisy building fan in courtyardInfo about EPA regulations What are the EPA rules about noise pollutionEPA sound pollution Mis-conceptionMis-translationMis-formulationPolysemySynonimy* To Google or to GOTO, Business Week Online, September 28, 200

15、1 Users empirical evaluation of resultsQuality of pages varies widelyRelevance is not enoughOther desirable qualities (non IR!)Content: Trustworthy, new info, non-duplicates, well maintained,Web readability: display correctly & fastNo annoyances: pop-ups, etcPrecision vs. recallOn the web, recall se

16、ldom mattersWhat mattersPrecision at 1? Precision above the fold?Comprehensiveness must be able to deal with obscure queriesRecall matters when the number of matches is very smallUser perceptions may be unscientific, but are significant over a large aggregateUsers empirical evaluation of enginesRele

17、vance and validity of resultsUI Simple, no clutter, error tolerantTrust Results are objectiveCoverage of topics for poly-semic queriesPre/Post process tools providedMitigate user errors (auto spell check, syntax errors,)Explicit: Search within results, more like this, refine .Anticipative: related s

18、earchesDeal with idiosyncrasiesWeb specific vocabularyImpact on stemming, spell-check, etcWeb addresses typed in the search box Loyalty to a given search engine(iProspect Survey, 4/04)The Web corpusNo design/co-ordinationDistributed content creation, linking, democratization of publishingContent inc

19、ludes truth, lies, obsolete information, contradictions Unstructured (text, html, ), semi-structured (XML, annotated photos), structured (Databases)Scale much larger than previous text corpora but corporate records are catching up.Growth slowed down from initial “volume doubling every few months” bu

20、t still expandingContent can be dynamically generatedThe WebThe Web: Dynamic contentA page without a static html versionE.g., current status of flight AA129Current availability of rooms at a hotelUsually, assembled at the time of a request from a browserTypically, URL has a ? character in itApplicat

21、ion serverBrowserAA129Back-enddatabasesDynamic contentMost dynamic content is ignored by web spidersMany reasons including malicious spider trapsSome dynamic content (news stories from subscriptions) are sometimes delivered as dynamic contentApplication-specific spideringSpiders commonly view web pa

22、ges just as Lynx (a text browser) wouldNote: even “static” pages are typically assembled on the fly (e.g., headers are common)The web: sizeWhat is being measured?Number of hostsNumber of (static) html pagesVolume of dataNumber of hosts netcraft survey/archives/web_server_survey.htmlMonthly report on

23、 how many web hosts & servers are out thereNumber of pages numerous estimates (will discuss later)Netcraft Web Server Survey/archives/web_server_survey.html The web: evolutionAll of these numbers keep changingRelatively few scientific studies of the evolution of the web Fetterly & al, 2003/research/

24、sv/sv-pubs/p97-fetterly/p97-fetterly.pdfSometimes possible to extrapolate from small samples (fractal models) Dill & al, 2001/conf/2001/P069.pdfRate of changeCho00 720K pages from 270 popular sites sampled daily from Feb 17 Jun 14, 1999 Any changes: 40% weekly, 23% dailyFett02 Massive study 151M pag

25、es checked over few monthsSignificant changed - 7% weeklySmall changes 25% weeklyNtul04 154 large sites re-crawled from scratch weekly8% new pages/week 8% die5% new content25% new links/week Static pages: rate of changeFetterly et al. study (2002): several views of data, 150 million pages over 11 we

26、ekly crawlsBucketed into 85 groups by extent of changeOther characteristicsSignificant duplicationSyntactic 30%-40% (near) duplicates Brod97, Shiv99b, etc.Semantic ?High linkage More than 8 links/page in the average Complex graph topologyNot a small world; bow-tie structure Brod00SpamBillions of pag

27、esSpamSearch Engine OptimizationThe trouble with paid placementIt costs money. Whats the alternative?Search Engine Optimization:“Tuning” your web page to rank highly in the search results for select keywordsAlternative to paying for placementThus, intrinsically a marketing functionPerformed by compa

28、nies, webmasters and consultants (“Search engine optimizers”) for their clientsSome perfectly legitimate, some very shadySimplest formsFirst generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most mauis and resortsSEOs responded wi

29、th dense repetitions of chosen termse.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web pageRepeated terms got indexed by crawlersBut not visible to humans on browsersPure word density cannot be trusted as an IR signalVariants of k

30、eyword stuffingMisleading meta-tags, excessive repetitionHidden text with colors, style sheet tricks, etc.Meta-Tags = “ London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, ”Search engine optimization (Spam)MotivesCommercial, political, religio

31、us, lobbiesPromotion funded by advertising budgetOperatorsContractors (Search Engine Optimizers) for lobbies, companiesWeb mastersHosting servicesForumsE.g., Web master world ( )Search engine specific tricks Discussions about academic papers CloakingServe fake content to search engine spiderDNS cloa

32、king: Switch IP address. Impersonate Is this a SearchEngine spider?YNSPAMRealDocCloakingThe spam industryMore spam techniquesDoorway pagesPages optimized for a single keyword that re-direct to the real target pageLink spammingMutual admiration societies, hidden links, awards more on these laterDomai

33、n flooding: numerous domains that point or re-direct to a target page RobotsFake query stream rank checking programs“Curve-fit” ranking programs of search enginesMillions of submissions via Add-UrlThe war against spamQuality signals - Prefer authoritative pages based on:Votes from authors (linkage s

34、ignals)Votes from users (usage signals) Policing of URL submissionsAnti robot test Limits on meta-keywords Robust link analysisIgnore statistically implausible linkage (or text)Use link analysis to detect spammers (guilt by association)Spam recognition by machine learningTraining set based on known

35、spamFamily friendly filtersLinguistic analysis, general classification techniques, etc.For images: flesh tone detectors, source text analysis, etc.Editorial interventionBlacklistsTop queries auditedComplaints addressedSuspect pattern detectionMore on spamWeb search engines have policies on SEO pract

36、ices they tolerate/block/help/us/ysearch/index.html /intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEOs and web search enginesResearch http:/Answering “the need behind the query” Semantic analysisQuery language determinationAuto filtering Different ranking (if query in

37、Japanese do not return English)Hard & soft (partial) matchesPersonalities (triggered on names)Cities (travel info, maps)Medical info (triggered on names and/or results)Stock quotes, news (triggered on stock symbol)Company infoEtc. Natural Language reformulationIntegration of Search and Text Analysis

38、The spatial context - geo-search Two aspectsGeo-coding - encode geographic coordinates to make search effectiveGeo-parsing - the process of identifying geographic context. Geo-codingGeometrical hierarchy (squares)Natural hierarchy (country, state, county, city, zip-codes, etc)Geo-parsingPages (infer

39、 from phone nos, zip, etc). About 10% can be parsed.Queries (use dictionary of place names) UsersExplicit (tell me your location - used by NL, registration, from ISP)From IP dataMobile phones In its infancy, many issues (display size, privacy, etc)Yahoo!: britney spearsAsk Jeeves: las vegasYahoo!: s

40、alvador hotelsYahoo shortcutsVarious types of queries that are “understood” Google andrei broder new yorkAnswering “the need behind the query”: ContextContext determination spatial (user location/target location)query stream (previous queries)personal (user profile) explicit (user choice of a vertic

41、al search, ) implicit (use Google from France, use google.fr)Context useResult restrictionKill inappropriate resultsRanking modulationUse a “rough” generic ranking, but personalize laterGoogle: dentists bronxYahoo!: dentists (bronx)Query expansionContext transferNo transferContext transferTransfer f

42、rom search resultsSize of the webEstimating web size and search engine index sizeWhat is the size of the web ?IssuesThe web is really infinite Dynamic content, e.g., calendar Soft 404: / is a valid pageStatic web contains syntactic duplication, mostly due to mirroring (30%)Some servers are seldom co

43、nnectedWho cares?Media, and consequently the userEngine designEngine crawl policy. Impact on recall.What can we attempt to measure?The relative sizes of search engines The notion of a page being indexed is still reasonably well defined.Already there are problemsDocument extension: e.g. engines index

44、 pages not yet crawled, by indexing anchortext.Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.New definition?The statically indexable web is whatever search engines

45、index.Different engines have different preferences max url depth, max count/host, anti-spam rules, priority rules, etc.Different engines index different things under the same URL:frames, meta-keywords, document restrictions, document extensions, .Statistical methods Random queries Random searches Ra

46、ndom IP addresses Random walksA B = (1/2) * Size AA B = (1/6) * Size B(1/2)*Size A = (1/6)*Size B Size A / Size B = (1/6)/(1/2) = 1/3 Sample URLs randomly from ACheck if contained in Band vice versa A B Each test involves: (i) Sampling (ii) CheckingRelative Size from Overlap Bharat & Broder, 98Sampl

47、ing URLsIdeal strategy: Generate a random URL and check for containment in each index. Problem: Random URLs are hard to find! Enough to generate a random URL contained in a given Engine. Key lesson from this lecture.Random URLs from random queries Bharat & B, 98 Generate random query: how? Lexicon:

48、400,000+ words from a crawl of Yahoo! Conjunctive Queries: w1 and w2e.g., vocalists AND rsiGet 100 result URLs from the source engineChoose a random URL as the candidate to check for presence in other engines.This distribution induces a probability weight W(p) for each page. Conjecture:W(SE1) / W(SE

49、2) |SE1| / |SE2|Query Based CheckingStrong Query to check for a document D: Download document. Get list of words. Use 8 low frequency words as AND queryCheck if D is present in result set.Problems:Near duplicatesFramesRedirectsEngine time-outsMight be better to use e.g. 5 distinct conjunctive querie

50、s of 6 words each.Computing Relative Sizes and Total Coverage BB98Six pair-wise overlapsfah* a - fha* h = e1fai* a - fia* i = e2fae* a - fea* e = e3fhi* h - fih* i = e4fhe* h - feh* e = e5fei* e - fie* i = e6Arbitrarily, let a = 1. We have 6 equations and 3 unknowns.Solve for e, h and i to minimize

51、S ei2 Compute engine overlaps.Re-normalize so that the total joint coverage is 100%a = AltaVista, e = Excite, h = HotBot, i = Infoseekfxy = fraction of x in yAdvantages & disadvantagesStatistically sound under the induced weight.Biases induced by random query Query Bias: Favors content-rich pages in

52、 the language(s) of the lexiconRanking Bias: Solution: Use conjunctive queries & fetch allChecking Bias: Duplicates, impoverished pages omittedDocument or query restriction bias: engine might not deal properly with 8 words conjunctive queryMalicious Bias: Sabotage by engine Operational Problems: Tim

53、e-outs, failures, engine inconsistencies, index modification.Random searchesChoose random searches extracted from a local log Lawrence & Giles 97 or build “random searches” NotessUse only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantages & disadvant

54、agesAdvantageMight be a better reflection of the human perception of coverageIssuesSamples are correlated with source of logDuplicatesTechnical statistical problems (must have non-zero results, ratio average, use harmonic mean?)Random searches Lawr98, Lawr99575 & 1050 queries from the NEC RI employe

55、e logs6 Engines in 1998, 11 in 1999Implementation:Restricted to queries with 80% = Documents are “near duplicates”Not transitive, though sometimes used transitivelyComputing SimilarityFeatures:Segments of a document (natural or artificial breakpoints)Shingles (Word N-Grams)a rose is a rose is a rose

56、 a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles)Set intersection Brod98 (Specifically, Size_of_Intersection / Size_of_Union )Jaccard measureShingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expe

57、nsive/intractable; Approximate using a cleverly chosen subset of shingles from each doc - a sketch Estimate (size_of_intersection / size_of_union) based on a short sketch of each of a doc pairSketch of a documentCreate a “sketch vector” (of size 200) for each documentFor doc D, sketchD i is as follo

58、ws:Let f map all shingles in the universe to 0.2m (e.g., f = fingerprinting/hashing)Let pi be a random permutation on 0.2mPick MIN pi(f(s) over all shingles s in DDocuments that share t (say 80%) corresponding vector elements are near duplicatesComputing Sketchi for Doc1Document 1264264264264Start w

59、ith 64-bit f(shingles)Permute on the number linewith piPick the min valueTest if Doc1.Sketchi = Doc2.Sketchi Document 1Document 2264264264264264264264264Are these equal?Test for 200 random permutations: p1, p2, p200 to see the %ABHoweverDocument 1Document 2264264264264264264264264A = B iff the shing

60、le with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection)This happens with probability: Size_of_intersection / Size_of_unionBAWhy?Set SimilaritySet Similarity (Jaccard measure)View sets as columns of a matrix; one row for each element in the universe. aij

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論