SimilarIt will sort entities by their textual properties similarity

Based on Pandas, Gensim with an Hebrew twist, originally developed for OpenBudget needs for quick entity comparison.

Notice: Currently, comparison will consider ONLY textual properties. Roadmap includes a planned step to consider quantitative variables as well.

Installation

  • pip install -r requirements.txt
  • verify correct settings in the sources.json file (you might prefer to create a new sources.json file, make sure to point it from settings.py)

Usage

  • define on settings.py what kind of entity you would like to explore. Notice it must be one of the entities defined in the sources.json file.
  • `python3 run.py <entity_id>``

Steps and modules (see comments in each of the modules files for further information)

  1. settings.py: to define sources file path, define which entity you would like to compare, how many results (similar entities) would you like to get.
  2. data_loader: load data packages and transform into pandas Dataframe
  3. tokenization: narrow and unify textual elements
  4. tfidf: similarity calculation based on term frequencies and cosinus similarity