gotoint
Web search engine.
Tasks
- Crawler
- HTML parser
- Content extraction
- Database
- Visited pages bloom filter
- Multithreading
- Message queue
- Priority queue
- Politeness
- Re-crawling
- Handling crawling traps, too long urls
- Distributed
- Language detection
- Duplicate detection
- DNS cache
- Index
- Query
- Webapp
- Project name
Check out
Deploy for development
Crawl pages.
docker-compose -f deploy/crawler.dev.yml up
Build inverted index.
docker-compose -f deploy/index.dev.yml up
Start web server.
docker-compose -f deploy/dev.yml up
References
[1] Web Crawling http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf
[2] Introduction to Information Retrieval https://nlp.stanford.edu/IR-book/