/Template_Extraction_NLP

This project is about Template Extraction from a document using NLP Techniques

Primary LanguageJupyter Notebook

CS 6320 – Natural Language Processing [Spring 2020]

Problem Description

Representing the information of text entities of interest and relation between them using NLP features and techniques like Information Extraction (IE), identifying and extracting templates from a text corpus. Information Extraction is useful in some of the real-time applications like question answering system, contact information search and removal of noisy data. However, the complexity of natural language can make it very difficult to access the information from the unstructured text. The goal of Information Extraction is to gain knowledge like entity relations by asking questions such as "Is this location a part of some location?" or "who is employed by what company?".

The project talks about extracting templates for Work, Buy and Part Of. Work template refers to a person employed in a company at a desired position_. Buy_ template extracts information about the transaction between the source and buyer. Part Of template extracts information regarding the role of the relationship between two locations. Our focus is to leverage NLP features and techniques to achieve this goal.

Proposed Solution

There are a few sub-tasks to achieve Information Extraction. It includes identifying the type of text corpus, segmentation of corpus, tokenizing the sentences into words, lemmatizing the words to extract lemmas, Part-of-Speech (POS) tagging, extracting hypernyms, hyponyms, and as such other NLP tasks.

Information Extraction is done on Wikipedia articles. Due to the large size and complexity of the article, it became hard to perform natural language processing tasks, which makes it is necessary to do sentence segmentation.

Sentence segmentation is the process of dividing the written text into meaningful units, such as words, sentences, or topics.

Now the next step is to create meaningful sentences from the segmented article, which is done by resolving the pronouns in the sentences using co-reference resolution.

Coreference resolution is the task of finding all expressions that refer to the same entity in a text.

The Coreference resolution is performed using the AllenNLP library. As the library is unable to handle big chunks of text input, we created batched input to obtain the resolution.

To recognize the entities in the sentence, Named Entity Recognizer of SpaCy is used.

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Buy template:

In this template, we had to extract buyer, item, price, quantity, and source. The sentences for this template is triggered by noun and verbs. There were many variations in the sentence structure for this template. Some of them are as follows:

  1. Buyer bought item from source for price.
  2. Source sold Item to Buyer for Price.

The above examples are verb triggered sentences.

Here, the Buyer and Source can be person and organization. Item can be organization and product. Named Entity Recognizer is used to recognize these noun entities as buyer, source, and item.

To discover the relationship between the entities, we need to know the semantic meaning of the sentence. Sentences triggering the Buy template have roles like agent, experiencer, benefactive, etc. To detect these entities in a sentence Semantic Role Labeling (SRL) is used.

In natural language processing, semantic role labeling is the process that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence, such as that of an agent, goal, or result.

The SRL is used only when a sentence contains a verb. Each verb sense has numbered arguments like Arg0, Arg1, Arg2, etc. along with modifiers like location, direction, manner, temporal, etc. In our sentences, Arg0 contains buyer which act as an agent, Arg1 tells about the item in the transaction and Arg2 contains the source which act as beneficiary.

Heuristic approaches were applied for identifying right entities from the sentence based on arguments.

Work Template

In this template, we need to extract a person, his/her position, organization, and location. A verb and words with the lemma "be." trigger work template sentences. For example:-

  1. PERSON, POSITION of ORG
  2. PERSON (worked|appointed|etc) [prep] POSITION at ORG
  3. PERSON [lemma = be]? POSITION [prep] ORG

The above sentences are formed by rule-based grammar. Here phrase structure grammar plays a vital role, especially noun phrases. To detect the noun phrases, we used Constituency Parser.

A constituency parsed tree displays the syntactic structure of a sentence using a context-free grammar.

The main reason to use Constituency Grammar was to get extraneous information and to visualize the entire sentence structure rather than just the grammatical dependencies. Using Constituency Parsing, we obtain noun phrases from which we extract required entities.

Location Template

In this template, we need to identify two locations that have PART OF relationship. Named Entity Recognizer is used to determine locations from the sentence.

Here, dependency parser is used to check the connectivity between two locations. If the sentence is triggered by a verb, a relation is obtained by looking at the subtrees of the verb.

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.

To reinforce this result, we used a public domain dictionary consisting of all the cities, states, and countries.

Full Implementation details

Programming Tools

Programming Tools used for this project are as below :

  1. Programming platform used – Google Colab
  2. For Segmentation, Named Entity Recognition, Dependency parser, Constituency Parser, CoReference Resolution, Semantic Role Labeling – SpaCY, AllenNLP, NLTK

SpaCY - https://spacy.io/

AllenNLP - https://demo.allennlp.org/semantic-role-labeling

NLTK - https://www.nltk.org/

Pandas - https://pandas.pydata.org/

Architectural Diagram

The below diagram represents the whole NLP pipeline to extract the template using NLP features and techniques.

Results and Error Analysis

Buy Template:-

Work Template:-

Part Of Template:-

Problems Encountered and Resolution

In Information Extraction, the main task is to extract entities of interest. For that, entity identification is to be done. We used SpaCY for Named Entity Resolution to identify PERSON, ORGANIZATION, LOCATION, MONEY, PRODUCT and CARDINAL.

Problem: Wrong NER results from SpaCY for PERSON and ORGANIZATION

Solution: Used AllenNLP to identify PERSON and ORG, due to its accurate results. We used results from both the tools for NER.

Problem: During Coreference Resolution, some of the entities were co-referenced and replaced by the whole sentence describing it. And as a result, when the co-referenced sentence was used for template extraction, problems were encountered in constituency parsing and dependency parsing.

Solution: After careful observation of co-reference results, we concluded that if the word is not PRONOUN and entity itself, then no need to change with the co-reference text. This helped us to get pristine and precise results from the constituency and dependency parser.

Problem: Having Determiners in sentences caused the problem in giving noun phrases properly when constituency parsing is performed on sentence.

Solution: Removed all the determiners while doing constituency parsing.

Pending Issues

In Buy template, sentences are triggered by a verb and noun words i.e., acquired, buy, sell, sold, acquisition. Our NLP pipeline is limited to extract buy template from sentences that are triggered by verbs.

For example, Amazon's acquisition of Whole Food Markets in 2012, increased their share value exponentially.

Here, the acquisition is noun and Semantic Role Labelling works on verb argument pattern.

In the Work template, our methodology solely depends on Constituency Parsing. Extracting entities from noun phrases obtained from constituency parsing have no semantic meaning. Therefore, if a sentence contains more than one PERSON entity, along with POSITION and ORGANIZATION, then with just syntactic meaning, it is hard to know the relation of PERSON with other detected entities and this can lead to False Positives.

For example - Steve Jobs and Powell Jobs are the co-founder of Apple, Inc.

In this case, using constituency parsing will give the correct template, but it solely depends on how noun phrases are formed.

Potential Improvements

In Buy template, we are using Semantic Role Labelling to get arguments and look for entities in those arguments. But there is still ambiguity in the correctness of QUANTITY identified from the arguments. To detect QUANTITY, we are using NER provided by SpaCY i.e., CARDINAL entities. To further improvise the heuristic approach, we can use dependency parser to check if the quantity is dependent on the item or not. In this way, we can look for the subtrees of an item to check for any CARDINAL entity attached to it.

Also, another potential improvement is to use constituency parsing on the sentences triggered by a noun. In this way, we can work on noun phrases obtained from constituency parsing and extract templates.

In Work template, potential improvement can be made by using SRL and Dependency parser. Using SRL, sentences triggered by verb can be easily extracted to obtain POSITION, PERSON and ORG from arguments. Also, using dependency parser, we can have strong standing on the correction of "Is this position actually refer to him/her?".

References

http://docs.allennlp.org/master/

https://spacy.io/usage/spacy-101

https://www.kaggle.com/nsharan/h-1b-visa

https://www.nltk.org/howto/wordnet.html

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.3746&rep=rep1&type=pdf

Authors

Parva Shah

Sagar Patel