Source code, data, and supplementary materials for our EMNLP 2015 article. Please use the following citation:
@InProceedings{habernal-gurevych:2015:EMNLP,
author = {Habernal, Ivan and Gurevych, Iryna},
title = {Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse},
booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
month = {September},
year = {2015},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {2127--2137},
url = {http://aclweb.org/anthology/D15-1255}
}
Abstract: Analyzing arguments in user-generated Web discourse has recently gained attention in argumentation mining, an evolving field of NLP. Current approaches, which employ fully-supervised machine learning, are usually domain dependent and suffer from the lack of large and diverse annotated corpora. However, annotating arguments in discourse is costly, error-prone, and highly context-dependent. We asked whether leveraging unlabeled data in a semi-supervised manner can boost the performance of argument component identification and to which extent is the approach independent of domain and register. We propose novel features that exploit clustering of unlabeled data from debate portals based on a word embeddings representation. Using these features, we significantly outperform several baselines in the cross-validation, cross-domain, and cross-register evaluation scenarios.
Contact person: Ivan Habernal, habernal@ukp.informatik.tu-darmstadt.de
http://www.ukp.tu-darmstadt.de/
Don't hesitate to send me an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
For license information, see LICENSE files in code/*
and NOTICE.txt
.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
code/experiments
-- the core experimental source codescode/dependencies
-- several in-house dependencies (see README.txt)data/argumentation-...
-- gold data from "Ivan Habernal and Judith Eckle-Kohler and Iryna Gurevych (2014) Argumentation Mining on the Web from Information Seeking Perspective In: Elena Cabrio and Serena Villata and Adam Wyner : Proceedings of the Workshop on Frontiers and Connections between Argumentation Theory and Natural Language Processing, p. 26--39, CEUR-WS, July 2014. http://ceur-ws.org/Vol-1341/."data/debates
-- extracted data from debate portalsdata/error-analysis
-- supplementary documents for manual error analysis of system predictions in PDF format
- Java 1.7 and higher, Maven
- tested on 64-bit Linux versions
- we recommend 32 GB RAM for running all the experiments
- Install all Maven dependencies from
code/dependencies/
to your local Maven repository
$cd code/dependencies
$chmod +x installDependencies.sh
$./installDependencies.sh
-
You don't have to setup any other 3-rd party Maven repository location, all dependencies are located either in this folder or on Maven central.
-
Compile the main experiment package in
code/experiments
$cd code/experiments
$mvn package
$cd code/experiments/target
$LC_ALL=en_US.UTF-8 java -XX:+UseSerialGC -Xmx32g \
-cp lib/*:de.tudarmstadt.ukp.experiments.argumentation.emnlp2015-0.0.3-SNAPSHOT.jar \
de.tudarmstadt.ukp.experiments.argumentation.sequence.evaluation.ArgumentSequenceLabelingEvaluation \
--featureSet fs0 \
--corpusPath ../../../data/argumentation-gold-annotated-sentiment-discourse-rst-full-bio-embeddings-emnlp2015-final-fixed \
--outputPath /tmp \
--scenario cd \
--clusters a100,s1000
The output will be stored in the outputPath
with sub-folders corresponding to the feature set and other parameters, including the timestamp.
--cl, --clusters
- Which clusters? Comma-delimited, e.g., s100,a500 (only for feature set 4)
--corpusPath, --c
- Corpus path with gold-standard annotated XMI files
--featureSet, --fs
- Feature set name (e.g., fs0, fs0fs1fs2, fs3fs4, ...)
--outputPath, --o
- Main output path (folder)
--paramE, --e
- Parameter e for SVMHMM (optional)
- Default: 0
--paramT, --t
- Parameter T for SVMHMM (optional)
- Default: 1
--scenario, --s
- Evaluation scenario (
cv
= cross-validation,cd
= cross domain,id
= in domain)
- Evaluation scenario (
- Preprocessing pipeline - from debates in XML to UIMA XMI files
- Extract XML debates in
data/debates
- Run
de.tudarmstadt.ukp.experiments.argumentation.comments.pipeline.DebatesToXMIPipeline
with two parametersinputFolderWithXMLFiles
-- extracted XML files with debatesoutputFolderWithXMIFiles
-- output dir
- (optional) You may want to select relevant debates; we used Lucene search
- Look in to the
de.tudarmstadt.ukp.experiments.argumentation.clustering.debatefiltering
packageLuceneIndexer
for indexing the XMI filesLuceneSearcher
for searching using some search terms- There are some hard-coded paths and search terms -- you need to modify the sources here
- Prepare data for CLUTO clustering
- Run
de.tudarmstadt.ukp.experiments.argumentation.clustering.ClutoMain word2VecFile sourceDataDir cacheFile tfidfModel clutoMatrixFile
- Provide
word2VecFile
- download
GoogleNews-vectors-negative300.bin.gz
from https://code.google.com/p/word2vec/
- download
- source dir (outputFolderWithXMIFiles from the previous step)
- the other three files will be newly created
- Run CLUTO
- Download from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
$vcluster -clmethod=rbr -crfun=i2 -sim=cos clutoMatrixFile numberOfClusters
- Create centroids
- Run
de.tudarmstadt.ukp.experiments.argumentation.clustering.ClusterCentroidsMain clutoMatrixFile clutoOutput outputCentroids
- Inject centroids into the experiment code to
main/src/main/resources/clusters
and modifyde.tudarmstadt.ukp.dkpro.argumentation.sequence.feature.clustering.ArgumentSpaceFeatureExtractor
, then run the experiments as described above
- Crawl createdebate.com using e.g. apache Nutch and extract the HTML content (using e.g. https://github.com/habernal/nutch-content-exporter)
- Convert HTML to internal XML documents
de.tudarmstadt.ukp.experiments.web.comments.createdebate.CorpusPreparator htmlFolder outputFolder
- In principle, the code can be used to predict argument components on unlabeled plain-text data (but I haven't tried that)
- You need to preprocess your data sequentially using the UIMA pipelines in
code/dependencies/de.tudarmstadt.ukp.dkpro.argumentation.annotations
basic
,advanced
- Label your data with argument BIO annotations (everything will be O)
- Modify
ArgumentSequenceLabelingEvaluation
so it uses all gold data for training and your data for test - Some nonsense evaluation will be printed-out, but the labeled data will be in the output directory (in CSV format, for instance; but they can put back to the documents - drop me a line if you need some advice)