/wit

Algorithms for "schema matching"

Primary LanguagePythonApache License 2.0Apache-2.0

wit

Algorithms for string classification and string embeddings using 'weak' supervision, with eventual application to 'schema alignment'.

NB: This package is in the middle of an API redefinition and simplification. The master branch is functional, but keep an eye out for changes. Ongoing work is being done on the api-v3 branch.

Method Overview

For schema alignment, basic idea is to:

  • learn an embedding of strings into dense N-dimensional vector representations s.t. instances of the same variable are closer than instances of other variables (recurrent neural networks)
  • align variables whose embedded distributions are "close" (solve assignment problem)

Notes

Here are two ways that we could think about similarity of strings:

  • syntactic : strings are similar, because they have similar structure

    • usernames : ben46 is close to frank123
    • subject_line : 'Re: good morning' is close to 'Re: circling back'
  • semantic : strings are similar, because of extrinsic information about the world

    • date : '2016-01-01' is close to 'Jan 1st 2016'
    • country : 'AR' is close to 'Argentina'

and here are two ways we could think about similarity of sets of strings:

  • distributional : sets have similar distributions

    • forum post_id : (near?) unique key
    • forum username : may follow similar distributions across domains
  • relational : sets have similar relationships to other sets of strings

    • relationship (eg mutual information) between post_id and username may be similar across domains

Software

Prototype code for calculating syntactic and semantic similarity are included in this repo.

Scripts

  • wit/examples/string-example.py shows how to build a string classifier (ie semantic)
  • wit/examples/simple-embedding-example.py shows how to use the triplet loss function to learn a string embedding (ie syntactic)
  • wit/examples/simple-alignment-example.py -- splitting and re-aligning a simple dataset

Notebooks

  • wit/notebooks/address-matching.ipynb -- trying to learn a good metric for addresses
  • wit/notebooks/simple-forum-notebook.py -- aligning schemas of multiple forums at once

More

See https://github.com/gophronesis/census-schema-alignment for some more concrete examples, developed during the January 2016 XDATA census hackathon.