/nama

Fast, flexible name matching for large datasets

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

NAMA The NAme MAtching tool

Fast, flexible name matching for large datasets

Installation

Recommended install via pip

  1. Create virtual env ``. Optional
  2. Install nama pip install git+https://github.com/bradhackinen/nama.git@master

Install from source with conda

  1. Install Anaconda

  2. Clone nama

git clone https://github.com/bradhackinen/nama.git
  1. Enter the conda directory where the conda environment file is with
cd conda
  1. Create new conda environment with
conda create --name <env-name>
  1. Activate the new environment with
conda activate <env-name>
  1. Download & Install pytorch-mutex
conda install pytorch-mutex-1.0-cuda.tar.bz2
  1. Download & Install pytorch
conda install pytorch-1.10.2-py3.9_cuda11.3_cudnn8.2.0_0.tar.bz2
  1. Install the rest of the dependencies with
conda install --file conda_env.txt
  1. Exit the conda directory with
cd ..
  1. Install the package with
pip install .

Installing from source with pip

  1. Clone nama git clone https://github.com/bradhackinen/nama.git
  2. Create & activate virtual environment python -m venv nama_env && source nama_env/bin/activate
  3. Install dependencies pip install -r requirements.txt
  4. Install the package with pip install ./nama
  • Install from the project root directory pip install .
  • Install from another directory pip install /path-to-project-root

Demo

Usage

Using the Matcher()

Importing data

To import data into the matcher we can either pass nama a pandas DataFrame with

import nama

training_data = nama.from_df(
    df,
    group_column='group_column',
    string_column='string_column')
print(training_data)

or we can pass nama a .csv file directly

import nama

testing_data = nama.read_csv(
    'path-to-data',
    match_format=match_format,
    group_column=group_column,
    string_column=string_column)
print(training_data)

See from_df & read_csv for parameters and function details

Using the EmbeddingSimilarityModel()

Initialation

We can initalize a model like so

from nama.embedding_similarity import EmbeddingSimilarityModel

sim = EmbeddingSimilarityModel()

If using a GPU then we need to send the model to a GPU device like

sim.to(gpu_device)

Training

To train a model we simply need to specifiy the training parmeters and training data

train_kwargs = {
    'max_epochs': 1,
    'warmup_frac': 0.2,
    'transformer_lr':1e-5,
    'score_lr':30,
    'use_counts':False,
    'batch_size':8,
    'early_stopping':False
}

history_df, val_df = sim.train(training_data, verbose=True, **train_kwargs)

We can also save the trained model for later

sim.save("path-to-save-model")

Testing

We can use the model we train above directly like

embeddings = sim.embed(testing_data)

Or load a previously trained model

from nama.embedding_similarity import load_similarity_model

new_sim = load_similarity_model("path-to-saved-model")
embeddings = sim.embed(testing_data)

MORE TO COME