Matching App
Dependencies
To run this, you need Python 3.10.
Installation
Clone the repository:
git clone https://gitlab-dogen.group.echonet/gf/csr/methodology_data/c2a/matching_app
Then, open the repository on your terminal or IDE.
Install with pdm
I recommend using pdm for all python projects. This is very useful in order to share a project and make sure we have the correct dependencies.
You will find instructions in the README_PDM.md file in order to install this tool.
pdm install
Install with pip
pip install -r requirements.txt
Running the dashboard
Running with pdm
pdm run dashboard
Running with pip
python -m streamlit run src/page_streamlit.py
TODO
- Support other file formats (xlsx and parquet)
- Add encoding option for csv
- Indicate more clearly the ongoing process (especially for string cleaning)
- Add new distances (with their options)
- Add all distances available in rapidfuzz
- Add the different functions from these distances
-
Show options for JaroWinkler (prefix weight) and Levenshtein (weights)
-
Add JaccardModified distance - Add option to add a column id in addition to the column name
- Add possibility to remove company suffixes
- Add mapping functionality
- Interface to select the correct match directly on the dashboard
- Add comments column to provide info on the matching for audit trail
- Create another sheet for metadata
- Be able to change the output filename
- Try to use polars str function for better performance when cleaning
- Figure out a process when we already have a mapping file