want to know this, home to understand how people subjective to judge whether a page is important (his first thought). In fact, nothing less than the following:
We all know that
has a history of weight accumulation (domain name time, high quality, qualified, old) many people will refer to this page (chain direction), many people will refer to this page (reproduced or mirror), this page for quick browsing (shallow level), often new content (Updated) etc..
is the data collection phase, the web page from the web of the Internet world oceans to collect their own database for storage.
in the face of a large number of data need to be processed, many problems need to be considered in advance. Such as “instant capture” or “pre fetching data”? In the maintenance of data is “regular crawl” (a regular depth crawl, instead of the original data) or “incremental crawl” (in the original data as the foundation, to turn
spider is along the link to crawl and grab the page. How fast to grab the relative importance to the user information and achieve the broad coverage of the search engine is undoubtedly the key problems to consider.
, the first to see the simple search engine “sanbanfu”: data collection – > [index] – pretreatment; > ranking.
grab maintenance strategy
do not understand the principle of search engine er is streaking in Shanghai dragon.
2, link tracking
for information coverage, is actually a spider in the following links the two strategies: the depth and breadth of grab grab.
ass think also know that the breadth grab helps to get more information, grasping depth is helpful to get more comprehensive information. The search engine spiders in the grab data, usually are used in two ways, but want to compare, more than the depth breadth grab grab.
search engine in the early establishment, must be the seed bank to have a manual entry, otherwise it will make the connection tracking spider can not start. Following the seed bank, spider.
well, at the end of nonsense before, then plug the sentence: “Chinese first search index based search engine is the North skynet.
in the link tracking stage, in fact, the information available is only this page for quick browsing (shallow level), other information is not obtained.
is the first, how to capture important information.