scad-tool

Code and resources for the Author Name Disambiguation (AND) tool developed in the Scalable Author Disambiguation (SCAD) project.

Currently under construction!

Setup

$ mkdir scad
$ cd scad
$ git clone https://github.com/nlpAThits/scad-tool.git
$ cd scad-tool
$ conda create --name scad-env python=3.6
$ source activate scad-env
$ pip install -r scad-requirements-wo-wombat.txt 
$ git clone https://github.com/nlpAThits/WOMBAT.git
$ pip install WOMBAT/.
$ git clone https://github.com/conll/reference-coreference-scorers.git
$ unzip 'resources/wombat/*.zip' -d resources/wombat/

Starting the SCAD server

The following will start an instance of the SCAD server on the local machine on port 50001.

$ python scad-server/app.py localhost 50001 &

The above will start the server in the background and return a PID that can be used in

$ kill PID

to stop the server.

Starting the demo SCAD client

This project includes a simple Python client which processes a JSON file and disambiguates it by making API calls against the SCAD server. The following will process publications belonging to the block a smith from the KISTI corpus, using the semantic matching method avg_of_cos with a dblp-trained word2vec resource (cf. below):

$ python scad-client/run_simple_scad_client.py \
   --scad_url             http://localhost:50001 \
   --pubfile              publicationdata/full-kisti-plain-sng-sorted.json \
   --blocking_pattern     "'name': '(a[^\']* smith)'" \
   --name_matching_method match:shortname \
   --paramfile            resources/scad_params.json \
   --resourcefile         resources/scad_resources.json \
   --evaluate

The matching methods to use are specified in resources/scad_params.json.

Visualized example results can be found at https://nlpathits.github.io/scad-tool/ (Use 'Open in new tab/window')