This project was developed for the Bajaj HackRx2.0 hackathon according to a problem statement. Its features are -
- Typo resistant search engine
- Real time searches
- Breadth First Search web crawler
- Scheduling with Apache Airflow
- Container first approach
The project consists of 4 parts
- Apache Airflow
- Search Engine
- Web Crawlers
- Backend (incomplete, exists to serve the static HTML file in the
websitefolder)
The project can be started with docker-compose.
To get started, clone the repo and cd into it
git clone https://github.com/HackRx2-0/ps1_drop_table.git && cd ps1_drop_tableCreate an .env file
mkdir ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .envRun the initial setup. This will create the required databases
docker-compose up airflow-initThen lets start the services
docker-compose upAnd finally, install dependencies
pip install meilisearch newspaper3k tldRefer to the newspaper3k documentation if you face errors
Apache Airflow is used for scheduling the web-crawlers through the DAGs provided. However, you can skip its setup and follow the steps mentioned here.
Start an instance of Meilisearch using docker or through other ways as given in the documentation
docker run -it --rm \
-p 7700:7700 \
-v $(pwd)/meili_data:/meili_data \
getmeili/meilisearch:v1.0
Using your webcrawler of choice, replace the SCRAPE_URL, the client host URL and the client index. Once this is done, use python3 to run the scraper
Install dependencies
pip install meilisearch newspaper3k tldRefer to the newspaper3k documentation if you face errors
The docker-compose.yml file will setup Apache Airflow along with required dependencies.
Put any Python setup code you have into airflow/envsetup.py. This will run during the build stage
Using your webcrawler of choice, replace the SCRAPE_URL, the client host URL and the client index. Once this is done, use python3 to run the scraper
