/gotoint

Crawler and web search engine

Primary LanguageRust

gotoint

Web search engine.

Tasks

  • Crawler
    • HTML parser
    • Content extraction
    • Database
    • Visited pages bloom filter
    • Multithreading
    • Message queue
    • Priority queue
    • Politeness
    • Re-crawling
    • Handling crawling traps, too long urls
    • Distributed
    • Language detection
    • Duplicate detection
    • DNS cache
  • Index
  • Query
    • Webapp
  • Project name

Check out

Deploy for development

Crawl pages.

docker-compose -f deploy/crawler.dev.yml up

Build inverted index.

docker-compose -f deploy/index.dev.yml up

Start web server.

docker-compose -f deploy/dev.yml up

References

[1] Web Crawling http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf

[2] Introduction to Information Retrieval https://nlp.stanford.edu/IR-book/