Solution consists of two packages:
- com.vjache.crawler.engine - actually a crawler
- com.vjache.crawler.restapi - a RESTul API wor crawler
Also there is an embedded simple web site with dummy but linked pages for demo & testing purposes (see 'main/resources').
The server starts on localhost:8080 hence check port is free. The documents are stored at './crawler-data' directory.
TODO
- To make server restartable and scalable, I can use external queue server (e.g. RabbitMQ) instead of internal 'Crawler.queue', an also I need to store the fact that crawler loaded some page, this can be done in two ways:
- store downloaded URL in a data base
- or, store document files in such a way that allows fast check if document for particular URL is already downloaded. This is partially already done -- documents files are distributed over 256 buckets, and URL of a document stored in a separate file. But it is better to enhance this aspect to protect performance and concurrent file readings(files can be partiallu written). Also if we want a set of such a crawlers work on different computers in a collaboration, we would need to share such a store to make it possible for concurrent crawlers do not download the same URL, and of course they must be connected to the same queue on the queue server.