Matching App

Dependencies

To run this, you need Python 3.10.

Installation

Clone the repository:

git clone https://gitlab-dogen.group.echonet/gf/csr/methodology_data/c2a/matching_app

Then, open the repository on your terminal or IDE.

Install with pdm

I recommend using pdm for all python projects. This is very useful in order to share a project and make sure we have the correct dependencies.

You will find instructions in the README_PDM.md file in order to install this tool.

pdm install

Install with pip

pip install -r requirements.txt

Running the dashboard

Running with pdm

pdm run dashboard

Running with pip

python -m streamlit run src/page_streamlit.py

TODO

  • Support other file formats (xlsx and parquet)
  • Add encoding option for csv
  • Indicate more clearly the ongoing process (especially for string cleaning)
  • Add new distances (with their options)
    • Add all distances available in rapidfuzz
    • Add the different functions from these distances
    • Show options for JaroWinkler (prefix weight) and Levenshtein (weights)
  • Add JaccardModified distance
  • Add option to add a column id in addition to the column name
  • Add possibility to remove company suffixes
  • Add mapping functionality
  • Interface to select the correct match directly on the dashboard
    • Add comments column to provide info on the matching for audit trail
  • Create another sheet for metadata
  • Be able to change the output filename
  • Try to use polars str function for better performance when cleaning
  • Figure out a process when we already have a mapping file