/simple-distributed-crawler-library

A simple distributed crawler library in Golang

Primary LanguageGo

Simple Distributed Web Crawler Library

A simple distributed web crawler library that is written in Go.

The library is implemented completed from scratch. As a Golang practice project, it is mainly focused on the distributed structure. One needs to implement their own web parsers as shown in the examples.

It is the capstone project of the imooc's Golang course.

Architecture

As a distributed web crawler, it contains several components

Components are communicated using JSON-RPC.

Algorithm

The crawler uses breadth first search to scrape website.

Examples

There are two simple examples included:

TODO

  • separate service for saving data
  • separate service for parsing web data
  • frontend for display search results
  • use testcontainers in tests
  • separate service for checking duplication
  • Kubernetes deployment
  • gRPC and Protobuf version