A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically operated by search engines for the purpose of Web indexing (web spidering).

The whole web could be represented using a graph, where each url would be vertex and there would be a directed edge from url A to url B if web content of url A contains (i.e, points to) url B.

In order to crawl web we should start from the starting url, read the html content for the url and discover all the urls found in the html content. Now repeat the same thing for all the discovered urls.

Now a very important thing to consider: we cannot just go on crawling forever. The Indexed World Wide Web (The Internet) contains at least 5.27 billion web pages as of Wednesday, 31 March, 2021. So the depths of the paths in the graph representing WWW from root (url from where web crawling is started) to the leaves could be very huge. If we are not interested in discovering all the web pages out there:
  • we should use BFS and not DFS, because we might be interested more in discovering the nodes (web pages / links / urls) which are closer to the root (starting web page). Think of it as how we are more interested in discovering first degree and second degree connections compared to the higher degree ones.

    Since DFS goes deeper till it reaches leaf and is forced to backtrack, by using DFS we would discover higher degree connections first, before discovering all the lower degree connections.

    Since BFS is a level order search (searches level by level) we would discover lower order connections first.
  • we should have a termination condition so that we know when to stop crawling if the crawling takes a long time.

Enjoy building a simple web crawler as shown in code below. Don't forget to run the code in your local machine and play around with it, and learn as much as possible from it.

Login to Access Content

To learn about a bit more advanced Web Crawler, take a look at Multithreaded Web Crawler.


If you have any feedback, please use this form: https://thealgorists.com/Feedback.

Help Your Friends save 40% on our products