This is a domain crawler service which accepts a list of url's to crawl and keywords to search for.
The OpenAPI specification can be found in the api/openapi/api.yaml
file. This manages the API documentation and the request/response schema. I've also integrated the OpenAPI specification to do request body validation.
make start-local
# You can update the environment variables injected in docker-compose.yml file.
make start-local-docker
PORT - The port the server will listen on.
EXTRACTOR_CONCURRENT_LIMIT - The number of concurrent requests the extractor will make.
RATE_LIMIT_RPM - The rate limit configured for the service.
This project uses Go's errgroup
to manage concurrency.
The number of concurrent requests can be configured using the EXTRACTOR_CONCURRENT_LIMIT
environment variable.
The service is rate limited to 60 requests per minute (default).
This can be configured using the RATE_LIMIT_RPM
environment variable.
The service uses a simple in-memory cache to store the HTML document response. The cache is not persisted and is cleared on every restart. Have decided to not implement TTL for the cache.
The CI pipeline is configured using GitHub Actions.
This runs the tests and linting on every push to the repository.
make test
brew install golangci-lint
brew upgrade golangci-lint
make lint
- Cache key - I'm caching the entire HTML document returned by the URL. I didn't cache the
ExtractResult
because different keywords can be used for the same URL. The only caveat here is that the cache footprint can go big since we're caching the entire HTML document. - Rate limiting - I'm using a simple rate limiter which resets every minute.