The tool can be used as a Docker container.
- Prepare a folder with all CSV files that you would like to clean, e.g.
/path/to/files
- Build the docker image from the root directory of this project:
docker build -t data-cleaning:1.1.0 .
- Run the image with the folder you prepared mounted into the container. Also, set the environment accordingly:
docker run -d -p 5000:5000 --name "data_cleaning_tool" -v /path/to/files:/var/data:ro,Z -e TABLES_DIRECTORY=/var/data data-cleaning:1.1.0
You should be able to reach the tool under http://localhost:5000
. Note that this deployment is not meant to run in production in any way. Check the Gunicorn Documentation on how to deploy it properly (with SSL and everything).
The backend of the tool is written in Python3 and depends on the following libraries:
Name | URL | Install |
---|---|---|
scikit-learn | https://scikit-learn.org/stable/index.html | pip install scikit-learn |
Flask | https://flask.palletsprojects.com/en/1.1.x/ | pip install Flask |
Jinja2 | https://jinja.palletsprojects.com/en/2.11.x/ | pip install Jinja2 |
Pandas | https://pandas.pydata.org/ | pip install pandas |
Unidecode | https://pypi.org/project/Unidecode/ | pip install Unidecode |
Make sure you have Python3 and pip installed with the dependencies in the table above.
- Go to data_cleaning/tables.txt and insert the absolute paths of the .csv-files that you want to clean. Do not move, rename or delete this file!
- Go to the root folder of this project and execute
python -m data_cleaning.start_server
. This will start the server and process the data.- To run the program on a specific port, run
python -m data_cleaning.start_server -p PORTNUMBER
, with PORTNUMBER any portnumber you want. By default, the program will run on port 5000. - Functional dependencies are disabled by default since version v1.1.0. Add
--enable-functional-dependencies
to the command to enable this functionality.
- To run the program on a specific port, run
- Go to http://127.0.0.1:5000/ (or to another port) to use the client and start your cleaning procedure.