This project is an Automated Data Validation Tool that processes and validates data from CSV files. It generates summary and CSV reports based on the validation results. The tool consists of several modules and integrates with a CI/CD pipeline to automate testing and report generation.
.github/workflows/ci.yml
: GitHub Actions configuration for continuous integration and deployment.data/sample_data.csv
: Sample data file used for validation.reports/
: Directory where generated reports are saved.src/
:data_loader.py
: ContainsDataLoader
class for loading data from CSV files.data_processing_workflow.py
: Main script that processes the data and integrates validation and report generation.data_validator.py
: ContainsDataValidator
class for validating data columns and types.report_generator.py
: ContainsReportGenerator
class for generating and saving summary and CSV reports.
tests/
:test_data_loader.py
: Tests for theDataLoader
class.test_data_validator.py
: Tests for theDataValidator
class.test_report_generator.py
: Tests for theReportGenerator
class.
-
Clone the repository:
git clone https://github.com/Igorth/data-validation-automated cd data-validation-automated
-
Install dependencies:
pip install -r requirements.txt
- Prepare Data:
Place your CSV data file in the data/ directory. The sample file sample_data.csv is provided as an example.
- Run the Data Processing Workflow:
Execute the main data processing script to load, validate, and generate reports:
python src/data_processing_workflow.py
This script will:
- Load the data from the specified CSV file.
- Validate the data columns and types.
- Generate a summary report and a CSV report in the reports/ directory.
- Check Generated Reports:
After running the script, you can find the generated reports in the reports/
directory. The reports include:
summary_report.txt
: A summary of the data validation results.data_report.csv
: The CSV version of the validated data
To run the tests for the individual modules, use pytest
:
pytest tests/test_data_loader.py
pytest tests/test_data_validator.py
pytest tests/test_report_generator.py
The GitHub Actions configuration file .github/workflows/ci.yml
sets up a continuous integration pipeline that:
- Runs tests for each push or pull request to the
main
branch. - Generates reports using the
data_processing_workflow.py
script. - Uploads generated reports as artifacts.
- Validation: The tool checks for required columns, validates data types, and ensures no missing values.
- Reports: Generated reports provide insights into the data validation process, including the total number of records and unique values per column.