Answering where questions using generic geographic information

Abstract -- Automatic generation of answers to where-questions is a challenge to current Web search and question answering systems. While human-generated answers of where-questions are short, selective, and informative, machine responses are typically provided in a form of ranked documents and text snippets. Several approaches have been proposed to answer questions using the information available in documents and knowledge-bases. These methods assume that the answers can be retrieved completely inside the sources without any further modifications. In this research, we present an approach to generate answers to where-questions by selecting relevant pieces of information that can form responses similar to human-generated answers. We derive and use patterns of generic geographic information (e.g., type, scale, and prominence) encoded from the largest available machine comprehension dataset, MS MARCO v2.1. In our approach, the toponyms in the questions and answers of the dataset are encoded into sequences of generic information. Next, sequence prediction methods are used to model the relation between the generic information in the questions and their answers. Finally, we evaluate the performance of the predictive models generating generic form of answers to where-questions. The proposed approach can be used to augment querying databases and knowledge graphs to identify relevant information and to construct responses similar to human-generated answers.

Read the Paper

Installation

The implementation is mainly developed in Java (Oracle Java - version 8) using Maven v3 for library management. Several R scripts are used for postprocessing of the results and generating plots. The R scripts can be found in the R-scripts folder.

Prerequisites

Install a JEE Java IDE (-- e.g., Intellij IDEA), dependencies will be downloaded from Web repositories automatically. List of java dependencies:

Dependency	Version	Description
SMPF	2.40	Sequence Mining Library
slf4j	1.7.30	Logging Framework
jackson	2.10.1	JSON Serialization
gson	2.8.5	JSON I/O
geonames	1.0	Gazetteer Lookup
nominatim-api	3.4	Gazetteer Lookup
junit	4.11	Unit Testing

R libraries should be installed manually (uncomment installation code in the header of each script).

Dependency	Version	Description
arules	1.6-4	Association Rule Mining
TraMineR	2.0-14	Sequence Mining
TraMineRextras	0.4.6	Sequence Mining
classInt	0.4-2	Class Intervals (Jenk)
ggplot2	3.2.1	Generating Plots
ggpubr	0.2.4	Generating Vector Plots
gridExtra	2.3	Generating Plots
RColorBrewer	1.1-2	Color Coding
pastecs	1.3.21	Data Manipulation
Matrix	1.2-18	Data Manipulation
qlcMatrix	0.9.7	Data Manipulation
plyr	1.8.5	Data Manipulation
dplyr	0.8.3	Data Manipulation
data.table	1.12.8	Data Manipulation
stringr	1.4.0	Data Manipulation
fpc	2.2-4	Data Manipulation

Before running the code, you should download Microsoft MS MARCO dataset v2.1 and put the dev, train and eval datasets inside the dataset folder.

Running Source Code

Import the project into your JEE IDE. Wait until maven fetch the libraries from Web repositories. Check the configuration file and change the parameters Run the Workflow java file in the root package folder.

Configuration

Set the dataset folder, output folder (processed data, gazetteers local folder, sequence generation and prediction) in configuration file.

Workflow

The workflow includes the following steps which can be run in batch or separate (e.g., preprocessing) ways.

Packages

Source code contains 5 main package which are listed and described below:

dataset: reading dataset files and preprocessing
parser: reading parse results and filtering toponym-based where-questions
gazetteers: disambiguation of extracted place names
sequence: generating type, scale and prominence sequences
predict: sequence prediction

Each package has a runnable source file which can be run separately.

Build

To build the runnable jar file from the source run the following Maven command:

clean dependency:copy-dependencies insall

Then move the jar file (where_questions-1.0.jar) and dependency folder from the target folder to the root folder.

The resulted jar file of the project can be run using the following command.

java -jar where_questions-1.0-jar-with-dependencies

The results can be found in their folders (check the configuration file: properties.properties).

Running R-Scripts

R scripts can be found in R-Scripts folder. In this folder, you could run the batch script or you could test the scripts separately. Note that the scripts should be run after running java codes because their inputs is the outputs of the java-based program. In the following list the scripts are briefly introduced:

AssociationRuleMining.R: association rule mining based on sequences of place types, scale and prominence of questions and their answers.
Batch_Run.R: batch file that runs all of the other scripts sequentially.
importanceDistribution.R: analysis of importance value extracted from OpenStreeMap results.
OrdinalAnalysis.R: Analysis of ordinal values (scale and prominence)
OrdinalRepresentation.R: Analysis of ordinal values (scale and prominence sequences)
prominence-categorical-distribution.R: Analysis of prominence sequences
scale-categorical-distribution.R: Analysis of scale sequences
type-categorical-distribution.R: Analysis of type sequences
tsp_complexity.R:

Built With

Java 8 - The programming language used
R 3.6.2 - The scripting language used
Maven - Dependency Management
Intellij IDEA 2019.1.4 - Used to generate RSS Feeds

Authors

Ehsan Hamzei
Stephan Winter
Martin Tomko

License

This project is licensed under the MIT License.

Acknowledgments

Some parts of the code in prediction package is borrowed from IPredict Project (a sub-project of SMPF) and is extended to suit our goals.

hamzeiehsan/Template-for-answering-where-questions