An scalable web crawler, here a list of the feature of this crawler:
- This service can crawl recursively the web storing links it's text and the corresponding text embedding.
- We use a large language model (e.g Bert) to obtain the text embeddings, i.e. a vector representation of the text present at each webiste.
- The service is scalable, we use Ray to spread across multiple workers.
- The entries are stored into a vector database. Vector databases are ideal to save and retrieve samples according to a vector representation.
By saving the representations into a vector database, you can retrieve similar pages according to how close two vectors are. This is critical for a browser to retrieve the most relevant results.
Run the crawler with the terminal:
$ python cli_crawl.py --help
options:
-h, --help show this help message and exit
-u INITIAL_URLS [INITIAL_URLS ...], --initial-urls INITIAL_URLS [INITIAL_URLS ...]
-lm LANGUAGE_MODEL, --language-model LANGUAGE_MODEL
-m MAX_DEPTH, --max-depth MAX_DEPTH
Host the API with uvicorn
and FastAPI
.
uvicorn api_app:app --host 0.0.0.0 --port 80
Take a look to the example in start_api_and_head_node.sh
. Note that the ray head nodes needs to be initialized first.
For our use case, we simply use BERT model implemented by Huggingface to extract embeddings from the web text. More precisely, we use bert-base-uncased. Note that the code is agnostic and new models could be registered and added with few lines of code, take a look to llm/best.py
.
We use Milvus as our main database administrator software. We use a vector-style database due to its inherited capability of searching and saving entries based on vector representations (embeddings).
Start your standalone Milvus server as follows, I suggest using an multiplexer software such as tmux
:
tmux new -s milvus
milvus-server
Take a look under scripts/
to see some of the basic requests to Milvus.
You can also use the official docker compose
template:
docker compose --file milvus-docker-compose.yml up -d
We use Ray, is great python framework to run distributed and parallel processing. Ray follows the master-worker paradigm, where a head
node will request tasks to be executed to the connected workers.
- Setup the head node
ray start --head
- Connect your program to the head node
import ray
# Connect to the head
ray.init("auto")
In case you want to stop ray node:
ray stop
Or checking the status:
ray status
- Initialize the worker node
ray start
The worker node does not need to have the code implementation as the head node will serialize and submit the arguments and implementation to the workers.
The current implementation is a PoC. Many improvements can be made:
- [Important] New entrypoint in the API to search similar URL given text.
- Optimize search and API.
- Adding new LLMs models and new chunking strategies with popular libraries, e.g. LangChain.
- Storing more features in the vector DB, perhaps, generate summaries.
All issues and PRs are welcome 🙂.