The Deep Web: Surfacing Hidden Value
--Most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it.
--Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request.
--Search engines obtain their listings in two ways: Authors may submit their own Web pages, or the search engines "crawl" or "spider" documents by following one hypertext link to another. The latter returns the bulk of the listings
--Cross referencing web sites gives better results ie-google
--BrightPlanet's technology is a "directed-query engine."
--The deep Web is about 500 times larger than the surface Web, with, on average, about three times higher quality based on our document scoring methods on a per-document basis.
--Serious information seekers can no longer avoid the importance or quality of deep Web information. But deep Web information is only a component of total information available. Searching must evolve to encompass the complete Web.
How Things Work--Part one
--Within a data center, clusters or individual servers can be dedicated to specialized functions, such as crawling, indexing, query processing snippet generation, link-graph computations, rusult caching, and insertion of advertising content.
--Currently, the amount of Web data that search engines crawl and index is on the order of 400 tB, placing heavy loads on server and network infrastructure.
--The crawler initializes the queue with one or more seed urls. A good seed url will link to many high-quality web sites.
--Crawling proceeds by making an http request to fetch the page at the first url in the queue. When thecrawler fetches the page, it scans the contents for links to other urls and adds each previously unseen url to the queue. Finally the crawler saves the page content for indexing. Crawling cotinues until the queue is empty.
--Simple crawling algorithm must be extended to address the following issues
----Speed
----Politeness
----Excluded Content-robots.txt file determine whether the webmaster has specified that some or all of the site should be crawled.
----Duplicate Content
----Continuous crawling-carrying out full crawls at fixed intervals would imply slow response to important changes in the web.
----Spam rejection Primitive spamming techniques, such as inserting misleading keywords into pages that are invisible to the viewer.
-------Spammers also engage in cloaking, the process of delivering different content to crawlers than to site visitors.
Web Search Engines: Part 2
--Search engines use an inverted file to rapidly identify indexing terms.
--An indexer can create an inverted file in two phases. Scanning and Inversion
--Scaling up--document partitioning
--Term lookup-The webs vocabulary is unexpectedly large, containing hundreds of millions of distinct terms.
--Compression Indexers can reduce demands on disk space and memory by using compression algorithms for key data structures.
--Phrases Special indexing tricks permit a more rapid response.
--Anchor text Web browsers highlight words in a web page in indicate the presence of a link that users can click on
--Link Popularity Score--Frequency of incoming links
--Query-independent score ranking of websites
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment