Please enable JavaScript.
Coggle requires JavaScript to display documents.
9 [IR] Web information retrieval (Link analysis (Assunptions (Hyperlink…
9 [IR] Web information retrieval
Web and Web search
anchor text
a million pieces of anchor text with "ibm" send a strong signal
indexing anchor text
when indexing a document D, include anchor text from links pointing to D
many times anchor text is not useful
"click here"
Bowtie theory of the web
Web crawler
aka: web spider, web robot, web scutter
main idea
use known sites as the seeds or starting points
download information from these sites
follow the links from each to other sites
repeat 2 to 4
Link analysis
Exploiting hyperlink structure of web pages to find relevant and importance pages for a user query
Assunptions
Hyperlink from page A to page B is a recommendation of page B from the author og page A
If page A and page B are connected by a hyperlink, they might be on the same topic
used for crawling, ranking, computing the geographic scope of a web page, finding mirrored hosts, computing statistics of web pages and search engines, web page categorization
Popular method
Hypertext induced topic search(HITS)
two components
Authority
an authority is a page with many in-links
Hub
A hub is a page with many out-links
Strength
its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages
weakeness
It is easily spammed
Topic drift
Inefficiency at query time
the query time evaluation is slow. collecting the root set, expanding it and performing eigenvector computation are all expensive operations
PageRank