<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-17485143</id><updated>2011-12-13T19:58:49.456-08:00</updated><title type='text'>Search Engine Robots How They Work, What They Do.</title><subtitle type='html'>Search Engine acquire, store, crawls, indexes and organize all that data to help you find what you're looking for. Examples, tips, and hints for getting the most out of your search engine, for people who work on the web.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>7</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-17485143.post-113214342735501596</id><published>2005-11-16T04:12:00.000-08:00</published><updated>2005-11-16T04:17:07.356-08:00</updated><title type='text'></title><content type='html'>&lt;span style="font-family:arial;"&gt;Google: Scaling with the Web&lt;/span&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;span style="font-family:verdana;"&gt;&lt;span style="font-size:85%;"&gt;Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second.&lt;br /&gt;These tasks are becoming increasingly difficult as the Web grows. However, hardware performance and cost have improved dramatically to partially offset the difficulty. There are, however, several notable exceptions to this progress such as disk seek time and operating system robustness. In designing Google, we have considered both the rate of growth of the Web and technological changes. Google is designed to scale well to extremely large data sets. It makes efficient use of storage space to store the index. Its data structures are optimized for fast and efficient access&lt;/span&gt;&lt;span style="font-size:85%;"&gt;. Further, we expect that the cost to index and store text or HTML will eventually decline relative to the amount that will be available&lt;/span&gt;&lt;span style="font-size:85%;"&gt;. This will result in favorable scaling properties for centralized systems like Google.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-113214342735501596?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/113214342735501596/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=113214342735501596' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113214342735501596'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113214342735501596'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/11/google-scaling-with-web-creating.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17485143.post-113214059817235406</id><published>2005-11-16T03:26:00.000-08:00</published><updated>2005-11-16T04:12:12.186-08:00</updated><title type='text'></title><content type='html'>Web Search Engines {Scaling Up: 1994 - 2000}&lt;br /&gt;&lt;div align="justify"&gt;&lt;br /&gt;&lt;p&gt;&lt;span style="font-family:verdana;"&gt;&lt;span style="font-size:85%;"&gt;Search engine technology has had to scale dramatically to keep up with the growth of the web. In 1994, one of the first web search engines, the World Wide Web Worm (WWWW)had an index of 110,000 web pages and web accessible documents. As of November, 1997, the top search engines claim to index from 2 million (WebCrawler) to 100 million web documents (from &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.searchenginewatch.com/"&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;Search Engine Watch)&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;. It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents. At the same time, the number of queries search engines handle has grown incredibly too. In March and April 1994, the World Wide Web Worm received an average of about 1500 queries per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the increasing number of users on the web, and automated systems which query search engines, it is likely that top search engines will handle hundreds of millions of queries per day by the year 2000. The goal of our system is to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers.&lt;/span&gt;&lt;/p&gt;&lt;br /&gt;&lt;p align="center"&gt;&lt;span class="style3"&gt;&lt;strong&gt;(&lt;span class="style4"  style="color:#663333;"&gt;&lt;span style="color:#006600;"&gt;tip:&lt;/span&gt;&lt;span style="color:#000000;"&gt;if you need a niche&lt;/span&gt;&lt;/span&gt;&lt;span style="color:#000000;"&gt; &lt;span class="style5"&gt;search use google search&lt;/span&gt; &lt;span class="style6"&gt;@ the right bottom of this blog&lt;/span&gt;&lt;/span&gt;)&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-113214059817235406?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/113214059817235406/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=113214059817235406' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113214059817235406'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113214059817235406'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/11/web-search-engines-scaling-up-1994.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17485143.post-113023705003317057</id><published>2005-10-25T03:40:00.000-07:00</published><updated>2005-10-25T03:46:16.860-07:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Crawling the Web &lt;/h3&gt;&lt;div align="justify"&gt;Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;p align="justify"&gt;In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.&lt;br /&gt;&lt;p align="justify"&gt;It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol , and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-113023705003317057?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/113023705003317057/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=113023705003317057' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113023705003317057'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113023705003317057'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/10/crawling-web-running-web-crawler-is.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17485143.post-113023548430302271</id><published>2005-10-25T03:07:00.000-07:00</published><updated>2005-10-25T03:23:21.500-07:00</updated><title type='text'></title><content type='html'>&lt;div align="justify"&gt;&lt;span style="font-size:130%;color:#330033;"&gt;Google and its data centres.&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;p align="justify"&gt;Ever wondered how Google manages its 3 billion plus index? In this article I explain how big G is organised. &lt;/p&gt;&lt;p align="justify"&gt;Google uses data centres, these spread the workload between more than 10,000 computers. &lt;/p&gt;&lt;p align="justify"&gt;How Many Google data centers are there? As of 2nd January 2004, there are 13 data centres, these are located across the globe. More are added from time to time. &lt;/p&gt;&lt;p align="justify"&gt;Domain IP Address&lt;br /&gt;www-mc.google.com 66.102.7.100&lt;br /&gt;www-lm.google.com 66.102.9.100&lt;br /&gt;www-kr.google.com 66.102.11.100&lt;br /&gt;www-ex.google.com 216.239.33.100&lt;br /&gt;www-sj.google.com 216.239.35.100&lt;br /&gt;www-va.google.com 216.239.37.100&lt;br /&gt;www-dc.google.com 216.239.39.100&lt;br /&gt;www-fi.google.com 216.239.41.100&lt;br /&gt;www-ab.google.com 216.239.51.100&lt;br /&gt;www-in.google.com 216.239.53.100&lt;br /&gt;www-zu.google.com 216.239.55.100&lt;br /&gt;www-cw.google.com 216.239.57.100&lt;br /&gt;www-gv.google.com 216.239.59.100 &lt;/p&gt;&lt;p align="justify"&gt;Google crawls sites continually, then periodically it updates the data centres with a "fresh index", due to the sheer size of data involved, this can't happen instantly so result around the time of re-index appear to jump around. &lt;/p&gt;&lt;p align="justify"&gt;Each time you browse google you receive data from one of the data centres, this is often the closest geographically, but depends on traffic etc. &lt;/p&gt;&lt;p align="justify"&gt;Interested to see your results on a different data centre? Copy one of the URL's above into your browser. If the results are different on any of the centres a re-index is in progress. &lt;/p&gt;&lt;p align="justify"&gt;Google also has two test domains www2.google.com and www3.google.com, these are used to try out new search algorithms, so if your interested to see what might be your results in the future, try seaching on these, bear in mind that not all the experimental &lt;strong&gt;algorithms Googl&lt;/strong&gt;e develops become part of the main one.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-113023548430302271?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/113023548430302271/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=113023548430302271' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113023548430302271'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/113023548430302271'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/10/google-and-its-data-centres.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17485143.post-112893678901721672</id><published>2005-10-10T02:19:00.000-07:00</published><updated>2005-10-10T02:33:09.023-07:00</updated><title type='text'></title><content type='html'>&lt;strong&gt;Google Architecture Overview&lt;/strong&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;Most of &lt;strong&gt;Google&lt;/strong&gt; is implemented in C or C++ for efficiency and can run in either Solaris or Linux.&lt;/span&gt;&lt;/div&gt;&lt;div align="justify"&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;In Google, the web crawling (downloading of web pages) is done by several distributed &lt;strong&gt;crawlers&lt;/strong&gt;. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the &lt;strong&gt;storeserver&lt;/strong&gt;. The storeserver then compresses and stores the web pages into a &lt;strong&gt;repository&lt;/strong&gt;. Every web page has an associated ID number called a &lt;strong&gt;docID&lt;/strong&gt; which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "&lt;strong&gt;barrels&lt;/strong&gt;", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.&lt;br /&gt;The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.&lt;br /&gt;The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-112893678901721672?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/112893678901721672/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=112893678901721672' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/112893678901721672'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/112893678901721672'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/10/google-architecture-overview-most-of.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17485143.post-112851474201937550</id><published>2005-10-05T05:16:00.000-07:00</published><updated>2005-10-06T02:08:20.046-07:00</updated><title type='text'></title><content type='html'>&lt;p&gt;&lt;span style="font-family:verdana;"&gt;&lt;strong&gt;Description of PageRank Calculation&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;&lt;br /&gt;&lt;p align="justify"&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:&lt;br /&gt;We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. &lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;div align="left"&gt;&lt;br /&gt;&lt;p style="FONT-SIZE: 85%; FONT-FAMILY: verdana"&gt;The PageRank of a page A is given as follows:&lt;/p&gt;&lt;/div&gt;&lt;div align="left"&gt;&lt;br /&gt;&lt;/div&gt;&lt;p style="FONT-SIZE: 85%; FONT-FAMILY: verdana" align="center"&gt;PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) &lt;/p&gt;&lt;div align="center"&gt;&lt;br /&gt;&lt;/div&gt;&lt;p align="justify"   style="font-family:verdana;font-size:85%;"&gt;&lt;span style="font-family:verdana;"&gt;Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one. PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-112851474201937550?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/112851474201937550/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=112851474201937550' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/112851474201937550'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/112851474201937550'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/10/description-of-pagerank-calculation.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17485143.post-112851147988790221</id><published>2005-10-05T03:27:00.000-07:00</published><updated>2005-10-10T03:09:10.970-07:00</updated><title type='text'></title><content type='html'>&lt;div align="justify"&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;You'd think with all the fuss about &lt;strong&gt;indexing web pages&lt;/strong&gt; to add to search engine databases, that robots would be great and powerful beings. Wrong. Search engine robots have only basic functionality like that of early browsers in terms of what they can understand in a web page. Like early browsers, robots just can't do certain things. &lt;strong&gt;Robots&lt;/strong&gt; &lt;strong&gt;don't understand frames&lt;/strong&gt;, &lt;strong&gt;Flash movies&lt;/strong&gt;, &lt;strong&gt;images or JavaScript&lt;/strong&gt;. They can't enter &lt;strong&gt;password protected areas&lt;/strong&gt; and they can't click all those buttons you have on your website. They can be stopped cold while indexing a dynamically generated URL and slowed to a stop with JavaScript navigation. How Do Search Engine Robots Work?&lt;br /&gt;Think of search engine robots as automated data retrieval programs, traveling the web to find information and links.&lt;br /&gt;When you submit a web page to a search engine at the "Submit a URL" page, the new URL is added to the robot's queue of websites to visit on its next foray out onto the web. Even if you don't directly submit a page, many robots will find your site because of links from other sites that point back to yours. This is one of the reasons why it is important to build your link popularity and to get links from other topical sites back to yours.&lt;br /&gt;When arriving at your website, the automated robots first check to see if you have a &lt;strong&gt;robots.txt&lt;/strong&gt; file. This file is used to tell robots which areas of your site are off-limits to them. Typically these may be directories containing only binaries or other files the robot doesn't need to concern itself with.Robots collect links from each page they visit, and later follow those links through to other pages. In this way, they essentially follow the links from one page to another. The entire World Wide Web is made up of links, the original idea being that you could follow links from one place to another.&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;span style="font-family:verdana;font-size:85%;"&gt;&lt;em&gt;This is how robots get around.&lt;/em&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17485143-112851147988790221?l=searchenginerobots.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://searchenginerobots.blogspot.com/feeds/112851147988790221/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17485143&amp;postID=112851147988790221' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/112851147988790221'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17485143/posts/default/112851147988790221'/><link rel='alternate' type='text/html' href='http://searchenginerobots.blogspot.com/2005/10/youd-think-with-all-fuss-about.html' title=''/><author><name>Webcrawler</name><uri>http://www.blogger.com/profile/12996321960360773095</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry></feed>
