The press briefing claim dataset

GitHub Pipenv locked Python version DOI

💡 Info:

This repository holds the code to create the press briefing claim dataset. The main modules can be found in the src directory. Three notebooks in the root directory interface these modules and guides through the dataset creation process.

This repository is part of my bachelore theses with the title Automated statement extractionfrom press briefings. For more indepth information see the Statement Extractor repository.

⚙️ Setup:

This repository uses Pipenv to manage a virtual environment with all python packages. Information about how to install Pipenv can be found here. To create a virtual environment and install all packages needed, call pipenv install from the root directory.

Default directorys and parameter can be defined in config.py.

The wikification module relies on two wikification services, Dandelion and TagMe. API keys for these services can be created for free. The wikify module expects the environment variables DANDELION_TOKEN and TAGME_TOKEN.

📋 Content:

Data:

Code:

  • src conatins main modules to scrape, parse and import the data.

Notebooks:

💾 Dataset:

Besides the dataset from the SMC press briefings, a translated version of the IBM Debater® - Claim Sentences Search (IBM_Debater_(R)_claim_sentences_search) dataset, from the claim model comparision is used to balance the dataset. To create the training data, the Claim Sentences Search dataset needs to be preprocessed like in the claim model comparision repo and translated into german.

The SQLite database dataset.db has the following structure: database