/entity-fishing-client-python

Repository hosting the common code for the entity-fishing clients

Primary LanguagePythonApache License 2.0Apache-2.0

HIRMEOS Entity Fishing client

https://travis-ci.org/hirmeos/entity-fishing-client-python.svg?branch=master

Python client to query the Entity Fishing service API developed in the context of the EU H2020 HIRMEOS project (WP3). For more information about entity-fishing, please check the Entity Fishing Documentation.

Installation

The client can be installed using pip:

pip install entity-fishing-client

Usage

Disambiguation

from nerd import nerd_client
client = nerd_client.NerdClient()

To disambiguate text (> 5 words):

client.disambiguate_text(
    "Linux is a name that broadly denotes a family of free and open-source software operating systems (OS) built around the Linux kernel."
)

To disambiguate a search query

client.disambiguate_query(
    "python method acronym concrete"
)

To process a PDF:

client.disambiguate_pdf(pdfFile, language)

you can supply the language (iso form of two digits, en, fr, etc..) and the entities (only for text) you already know, using the format:

{
    'offsetEnd': 5,
    'offsetStart': 0,
    'rawName': 'Linux',
    'wikidataId': 'Q388',
    'wikipediaExternalRef': 6097297
}

The response is a tuple where the first element is the response body as a dictionary and the second element the error code. Here an example:

 (
     {'entities': [
         {
             'domains': ['Computer_Science'],
             'nerd_score': 0.3753,
             'nerd_selection_score': 0.7268,
             'offsetEnd': 5,
             'offsetStart': 0,
             'rawName': 'Linux',
             'type': 'PERSON',
             'wikidataId': 'Q388',
             'wikipediaExternalRef': 6097297
         },
         {
             'domains': ['Computer_Science'],
             'nerd_score': 0.7442,
             'nerd_selection_score': 0.85,
             'offsetEnd': 78,
             'offsetStart': 49,
             'rawName': 'free and open-source software',
             'wikidataId': 'Q506883',
             'wikipediaExternalRef': 1721496
         },
         {
             'domains': ['Electrotechnology', 'Electronics',
             'Computer_Science'],
             'nerd_score': 0.7442,
             'nerd_selection_score': 0.4487,
             'offsetEnd': 96,
             'offsetStart': 79,
             'rawName': 'operating systems',
             'wikidataId': 'Q9135',
             'wikipediaExternalRef': 22194
         },
         {
             'domains': [
                 'Electrotechnology', 'Electronics', 'Computer_Science'
             ],
             'nerd_score': 0.7442,
             'nerd_selection_score': 0.4487,
             'offsetEnd': 100,
             'offsetStart': 98,
             'rawName': 'operating systems',
             'wikidataId': 'Q9135',
             'wikipediaExternalRef': 22194
         },
         {
             'domains': ['Electronics', 'Computer_Science'],
             'nerd_score': 0.743,
             'nerd_selection_score': 0.8383,
             'offsetEnd': 131,
             'offsetStart': 119,
             'rawName': 'Linux kernel',
             'wikidataId': 'Q14579',
             'wikipediaExternalRef': 21347315
         }
     ],
     'global_categories': [
         {'category': 'Finnish inventions',
         'page_id': 27421536,
         'source': 'wikipedia-en',
         'weight': 0.09684039970133569},
        {'category': 'Free software programmed in C',
         'page_id': 11241711,
         'source': 'wikipedia-en',
         'weight': 0.06433942787438053},
        {'category': 'Unix variants',
         'page_id': 10429397,
         'source': 'wikipedia-en',
         'weight': 0.09684039970133569},
        {'category': 'Operating systems',
         'page_id': 693664,
         'source': 'wikipedia-en',
         'weight': 0.12888888710813473},
        {'category': 'Free software',
         'page_id': 693287,
         'source': 'wikipedia-en',
         'weight': 0.06444444355406737},
        {'category': 'Free system software',
         'page_id': 6721544,
         'source': 'wikipedia-en',
         'weight': 0.06433942787438053},
        {'category': 'Software licenses',
         'page_id': 703100,
         'source': 'wikipedia-en',
         'weight': 0.06444444355406737},
        {'category': 'Linux kernel',
         'page_id': 13215678,
         'source': 'wikipedia-en',
         'weight': 0.06433942787438053},
        {'category': 'Monolithic kernels',
         'page_id': 10730969,
         'source': 'wikipedia-en',
         'weight': 0.06433942787438053},
        {'category': '1991 software',
         'page_id': 11167446,
         'source': 'wikipedia-en',
         'weight': 0.09684039970133569},
        {'category': 'Linus Torvalds',
         'page_id': 53479567,
         'source': 'wikipedia-en',
         'weight': 0.09684039970133569}
     ],
     'language': {'conf': 0.9999973266294648, 'lang': 'en'},
     'nbest': False,
     'onlyNER': False,
     'runtime': 107,
     'sentences': [{'offsetEnd': 132, 'offsetStart': 0}],
     'text': 'Linux is a name that broadly denotes a family of free and open-source software operating systems (OS) built around the Linux kernel.'
     },
     200
)

Batch processing

The batch processing is implemented in the class NerdBatch. The class can be instantiated by defining the entity-fishing url in the constructor, else the default one is used.

To run the processing, the method process requires the input directory, a callback and the number of threads/processes. There is an already ready implementation in script/batchSample.py.

To run it:

  • under this work branch, prepare two folders: input which containing the input Pdf files to be processed and output which collecting the processing result
  • we recommend to create a new virtualenv, activate it and install all the requirements needed in this virtual environment using $ pip install -r /path/of/entity-fishing-client-python/source/requirements.txt
  • (temporarly, until this branch is not merged) install entity-fishing multithread branch in edit mode (pip install -e /path/of/entity-fishing-client-python/source)
  • run it with python runFile.py input output 5

KB access

nerd.get_concept("Q456")

with response

(
   {
     'rawName': 'Lyon',
     'preferredTerm': 'Lyon',
     'nerd_score': 0,
     'nerd_selection_score': 0,
     'wikipediaExternalRef': 8638634,
     'wikidataId': 'Q456',
     'definitions': [
       {
         'definition': "'''Lyon''' ( or ;, locally: ; ), also known as ''Lyons'', is a city in east-central [[France]], in the [[Auvergne-Rhône-Alpes]] [[Regions of France|region]], about from [[Paris]], from [[Marseille]] and from [[Saint-Étienne]]. Inhabitants of the city are called ''Lyonnais''.",
         'source': 'wikipedia-en',
         'lang': 'en'
       }
     ],
     'domains': [
       'Geology',
       'Sociology'
     ],
     'categories': [
       {
         'source': 'wikipedia-en',
         'category': 'World Heritage Sites in France',
         'page_id': 1178961
       },
       [...]
     ],
     'multilingual': [
       {
         'lang': 'de',
         'term': 'Lyon',
         'page_id': 13964
       },
       {
         'lang': 'es',
         'term': 'Lyon',
         'page_id': 46490
       },
       {
         'lang': 'fr',
         'term': 'Lyon',
         'page_id': 802627
       },
       {
         'lang': 'it',
         'term': 'Lione',
         'page_id': 41786
       }
     ],
     'statements': [
       {
         'conceptId': 'Q456',
         'propertyId': 'P1082',
         'propertyName': 'population',
         'valueType': 'quantity',
         'value': {
           'amount': '+500716',
           'unit': '1',
           'upperBound': '+500717',
           'lowerBound': '+500715'
         }
       },
       {
         'conceptId': 'Q456',
         'propertyId': 'P1082',
         'propertyName': 'population',
         'valueType': 'quantity',
         'value': {
           'amount': '+500716',
           'unit': '1',
           'upperBound': '+500717',
           'lowerBound': '+500715'
         }
       },
       {
         'conceptId': 'Q456',
         'propertyId': 'P1464',
         'propertyName': 'category for people born here',
         'valueType': 'wikibase-item',
         'value': 'Q8061504'
       },
       {
         'conceptId': 'Q456',
         'propertyId': 'P190',
         'propertyName': 'sister city',
         'valueType': 'wikibase-item',
         'value': 'Q5687',
         'valueName': 'Jericho'
       },
       {
         'conceptId': 'Q456',
         'propertyId': 'P190',
         'propertyName': 'sister city',
         'valueType': 'wikibase-item',
         'value': 'Q2079',
         'valueName': 'Leipzig'
       },
       {
         'conceptId': 'Q456',
         'propertyId': 'P190',
         'propertyName': 'sister city',
         'valueType': 'wikibase-item',
         'value': 'Q580',
         'valueName': 'Łódź'
       },
       [...]
     ]
   },
   200
)

Utilities

Language detection

nerd.get_language("This is a sentence. This is a second sentence.")

with response:

(
   {
      'sentences':
      [
         {'offsetStart': 0, 'offsetEnd': 19},
         {'offsetStart': 19, 'offsetEnd': 46}
      ]
   },
   200
)

Segmentation

nerd.segment("This is a sentence. This is a second sentence.")

with response:

(
    {
        "lang": "en",
        "conf": 0.9
    },
    200
)