/ORKG_challenge

Primary LanguageJupyter Notebook

ORKG_challenge : Build a research problem extraction system

The NLPContributionGraph (NCG) challenge was organized in SemEval 2021 . The general task information is available here https://ncg-task.github.io/

Appraoch : the task is organized into two parts:

  • Build a based graph model that will help model relationships between entities in order to visualize research problems solved by researchers in a kind of space

  • build a topic modeling (LDA) that will helps us extract abstract "topics" that appear in a collection of documents.

    1. Based graph model

In this part, we build a knowledge graph embeddings using AmpliGraph library. We trained the the datasets on four different models and the scoring results will be reported below.

Prerequisites: - Python ≥ 3.6 - Linux Box - Install Tensorflow 1.x version - Install AmpliGraph using: pip install ampligraph

  • Knowledge graph creation: the knowledge graph creation required a dataset into the form a a triple of < source, target, edge>. Using our dataset, we obtained a graph as showed in the image below

  • Models: models used in our experiments are: - ComplEx - TransE - DistMult - HolE

    The hyper-parameters are the same for each model and the average Loss is reported in the following table

Model Avg Loss
ComplEx 0.006
TransE 0.029
DistMult 0.010
HolE 0.923
  • Performance evaluation: The evaluation for neural embeddings models involved model selection routine, negatives generation and learning-to-rank-based implementation. The negatives generation is done simply by concatenating the train and test sets. We report the filtered MR(Mean Rank), MRR(Mean Reciprocal Rank), Hits@ for each model
Model MR MRR Hits@10 Hits@3 Hits@1
ComplEx 3.00 0.33 1.00 1.00 0.00
TransE 1.00 1.00 1.00 1.00 1.00
DistMult 1.00 1.00 1.00 1.00 1.00
HolE 1.00 1.00 1.00 1.00 1.00
  • Prediction: the code provides a possible method to restore the model for making prediction on test set. We can then see the scored values and transform them into probabilities (bound between 0 and 1).
  1. Topic modeling

Topic modeling is an unsupervised technique to identify natural topics in the text and, is used there o extract research in text document. It does not required model training and a labeled training dataset. There a few algorithms for topic modeling but here we only experiment LDA (Latent Dirichlet Allocation). In this tutoriel, we will perform the following tasks: - Loading data - Data preprocessing - Data exploration - Data preparation - LDA model training - Analyzing the LDA model

  • LDA model training Here we import our model from gensim library and we keep parameters as default except the number of topics set as 10

  • Analyzing the LDA modeling Here we visualize the topics from the trained model. To do so, we used pyLDAvis, a python library for interactive topic model visualization that will help us understand and interprete individual topics and relationships between them. The output is showed in the image below

alt text