Comp-550 Fall 2023 term group project
This README provides instructions for setting up and running a project that uses Data Version Control (DVC). DVC is an open-source tool for data science and machine learning projects. It allows for tracking and versioning of datasets and machine learning models, making it easier to share and reproduce experiments and analyses.
Before you start, ensure you have the following installed:
- Python (version as per project requirements)
- pip (Python package manager)
- virtualenv (Python environment management tool)
- DVC (Data Version Control)
Clone the project from the provided source and navigate to the project directory.
python -m venv venv
On Windows: venv\Scripts\activate
macOS and Linux: source venv/bin/activate
pip install -r requirements.txt
Run the following in the terminal:
python -m nltk.downloader -d data/nltk_data all
python -m spacy download en_core_web_sm
Run dvc repro to execute the DVC pipeline. DVC manages the data processing stages as per the dvc.yaml file.
To run the project, run:
dvc repro
This will run the entire pipeline from start to finish. If you wanna see the dag for the project, run dvc dag
.