Abstract:
Today’s search engines are equipped with specialized agents known as Web crawlers (download robots) dedicated to crawling large Web contents on line. Crawlers interact with thousands of Web servers over periods extending from a few weeks to several years. Large scale search engine such as Google use distributed crawler to crawl the entire WWW. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. This paper presents design and implemented a scalable distributed crawler by using distributed programming facilities provided by Java RMI. Hash based partitioning is used to partition the urls among the crawlers; communication among crawler is done by Remote Method Invocation. The Crawler can run many crawler instances at the same time.