Database generation infrastructure for scraping projects.
- Derive and generate a database structure from API responses.
- Analyze websites for scraping processes
- Archive websites
- Derive crawling and scraping processes from websites architecture. (Utilizes LLMs and therefore is more hardware intensive!)
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
(https://docs.docker.com/engine/install/ubuntu/)
For GPU support, additionally install the NVIDIA-Docker-Runtime:
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu20.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2 nvidia-container-runtime
sudo systemctl restart docker
(https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0))
Note: If you choose GPU-support, from now on replace the command docker
with nvidia-docker
!
After cloning or downloading and unpacking, you might want to add to or adjust the code. An example might be Dockerfile
, specifically the installed packages.
docker build -f <path to repo folder>/Dockerfile.cpu -t scraping-database-generator-container:v0.1
docker run \
-it --net=host -p 7860:7860 --gpus all \
--mount type=bind,source=<my local text generation model folder>,target=/scraping-database-generator-container/machine_learning_models/GENERATION_MODELS \
--mount type=bind,source=<my local text embedding model folder>,target=/scraping-database-generator-container/machine_learning_models/EMBEDDING_MODELS \
--mount type=bind,source=<my local LoRA model folder>,target=/scraping-database-generator-container/machine_learning_models/LORAS \
--mount type=bind,source=<my local working data folder>,target=/scraping-database-generator-container/data \
"scraping-database-generator-container:v0.1"
Note, that mounting the four shared folders above is not mandatory but strongly recommended. Not linking these folders will result in potentially duplicate models and working data being downloaded into the docker container's folder structure and therefore will blow up the container size.
Note, that you can also open a terminal by appending /bin/bash
to the command above. You will get to a terminal inside the running container. Afterwards you can start the explorer manually with bash run.sh
.
If you exit the container and it is stopped, you can use
docker ps --all
to retrieve the name of the scraping-database-generator-container:v0.1
container and rerun and interactively enter it with
docker restart <container name> && docker exec -it <container name> /bin/bash
Inside the docker container's shell, you can run the explorer again by using
bash run.sh