Start with a page, any page. Crawl a total of 100 pages and create a basic search index for words.
- Fetch a Page
- Extract Content: Title of the page, Body of the page
- Extract all the links
- Visit all links in a breadth-first manner i.e. links on the current page get visited first, then links from the first link, then from seconf link, and so on...
- Store the content in an organised fashion for retrieval later.
- Avoid recursion by maintain a list of links already visited.
- Respect 'robots.txt'
- Watch for cache tags and revisit or update page after expiry
- Make the script configurable.