dbpedia/GSoC

A Neural QA Model for DBpedia

Closed this issue ยท 20 comments

mgns commented

Description

In the last years, the Linked Data Cloud has grown to over 100 billion facts pertaining to a multitude of domains. The DBpedia knowledge base consists of 4.58 million things on its own. However, accessing this information is challenging for lay users as they are not able to use SPARQL as querying language without exhaustive training.
Recently, Deep Learning architectures based on Neural Networks called seq2seq have shown to achieve the state-of-the-art results at translating sequences into sequences. In this direction, we suggest a GSoC topic around Neural Networks to translate any natural language expression into sentences encoding SPARQL queries. Our preliminary work on Question Answering with Neural SPARQL Machines (NSpM) shows promising results but it is restricted on selected DBpedia classes.
In this GSoC project, the candidate will extend the NSpM to cover more classes of DBpedia and to enable high-quality Question Answering.
The source code can be found here, however we will use this repository as workspace.

Goals

  • Create query templates for DBpedia.
  • Train the NSpM recurrent neural network for complex question answering on DBpedia.
  • (Optional) Evaluate the model against the QALD benchmark.

Impact

The project will allow users to access DBpedia knowledge using natural language.

Warm-up tasks

Mentors

Tommaso Soru, Edgard Marx, Ricardo Usbeck

Keywords

question answering, deep learning, neural networks, sparql, tensorflow, python

Sounds interesting! I will have a look on this after my exams.

@mgns Hi. This sounds very interesting to me. I had a few queries to ask though.

You are welcome to ask them here.

@RicardoUsbeck Yeah so I was attempting the Warmup tasks. Can you please explain how should I attempt the second task?

@mommi84 is the better person to explain that :)

@abhinavralhan Hi! I added a link in the project description.

@mommi84 a little problem with the last inference bit. I got past most problems except this one I cannot understand. So I ran the command below, but for some reason vocab file is not getting generated. I'm guessing it has to do with something in the ask.sh file.

sudo sh ask.sh data/monument_300_model "where is edward vii monument located in?"

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 495, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 488, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 452, in run_main
hparams = create_or_load_hparams(out_dir, default_hparams, flags.hparams_path)
File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 418, in create_or_load_hparams
hparams = extend_hparams(hparams)
File "/home/abhinavralhan/Desktop/os/NSpM/nmt/nmt/nmt.py", line 350, in extend_hparams
unk=vocab_utils.UNK)
File "nmt/utils/vocab_utils.py", line 66, in check_vocab
raise ValueError("vocab_file does not exist.")
ValueError: vocab_file does not exist.
train_prefix=None
dev_prefix=None
test_prefix=None
out_dir=../data/monument_300_model_model

ANSWER IN SPARQL SEQUENCE:
cat: output.txt: No such file or directory

@abhinavralhan Hi, I faced a similar issue. The problem is due to incorrect data directory passed to ask.sh. Currently it is data/monument_300_model. You need to change the input directory for ask.sh to data/monument_300 .

Thank you @abhinavralhan for sharing the bug and thanks @gyanesh-m for fixing it.

@mommi84 Hi, I have been trying to build some queries for dbo:EducationalInstitution class and I had a doubt related to query templates. In all the examples mentioned in annotations_monument.csv, the initial statement

    ?a a dbo:Monument 

was never used. Is it because you have already specified the domain for <A> in first column. Also, If I plan to use the initial statement for my example class in query template, will it be fine ?

Hey @gyanesh-m,
No you don't need to explicitly add that. prepare_generator_query adds the "initial statement" to the generator queries.

@piyush96chawla Oh , thanks.

Added example of successful project proposal. Once ready, please invite my username at gmail dot com to your proposal document.

The GSoC 2018 student applications are officially open! Please elaborate your proposal in a Google doc. When you're done, share it with my username at gmail.com, so I can invite also the other mentors. Deadline: March 27.

Interesting project! I have started working on it already.

Only 6 days to go!

Please share your document with us now, if you would like to have some feedback from the mentors before the final submission to the GSoC console.

@mommi84 I have completed my warmup task by training the NSpM model on class dbo:Garden.
I have experience in QA, deep learning, and I really appreciate and like the idea of a Neural QA model. What do you suggest I should do next in order to get ready to write a proposal? Sorry for such a late request for help.

@amanmehta-maniac Great. Please elaborate your view of the project and your findings in a Google document and share it with me at gmail.com. Use this one as a reference for a good proposal. Send it out on Sunday 25th at the latest, so we can give you some feedback.

@mommi84, there are around 770 total number of DBpedia classes and hence what would be a realistic 'X', where X is the number of classes that I will train this NN for during the GSoC tenure.
Also, another matter of concern is, will I be getting access to any server, because this project involves training NN which is highly CPU intensive and takes order of hours on 64 core-CPU and 512 GB RAM machine.

@amanmehta-maniac We expect to have one final QA model, not as many as the number of classes. Moreover, these 770 classes are organized in a taxonomy. We will likely be able to grant the student access to our servers upon availability for a limited time (up to 14 days in a row). Of course, additional hardware provided by the student's institution is welcome.