The goal of the project is to build the pipeline to automate the process of generating SVO triplets for the use of social science research. For example, character relationships can be visualized using networks in Gephi based on SVO triplets. In the end, we want to integrate the pipeline into the NLP software PC-ACE developed by Professor Roberto Franzosi at Emory from Sociology Department.
The whole pipeline is composed of three steps:
- Data Cleaning
- Anaphora Resolution
- SVO Triplets Extraction
- Clean data converted from pdf format
- Extract titles and contents of Emory Lynching articles and separate them into two parts
- Replace mentions of entities (e.g. pronouns like "he" and "she") with their most representative representations using Stanford CoreNLP's coreference (anaphora) resolution
- Used to maximize and validate SVO extraction by correctly identifying actors
For example:
Bill Cato Attempted to Assault Mrs. Vickers. He was shot to death.
will look like
Bill Cato Attempted to Assault Mrs. Vickers. Bill Cato was shot to death.
after anaphora resolution.
- Format Emory Lynching Corpus
cleaned_corenlp_lynching.txt
intoclausie_input.txt
to be ready for ClausIE in order to get triplets - Extract only SVO's from
sentences-test-out.txt
tosvo.txt
- Filter SVO sets into
terminal_svo.txt
by preserving only triplets with a confirmed social actor as the subject
The SVO results will look like the following (verbs are converted into stem, so estim
means estimate
):
S: mob , V: estimate , O: shooting
S: girl , V: protect , O: negro
S: prisoner , V: have , O: neck
-
output file is ready to be seen by Gephi
Node1 Edge Node2 0 people have wrath 1 people have hands 2 county have duty 3 sheriff convene court 4 sheriff try criminals
- Stanford CoreNlp
- NLTK
- ClausIE
- enchant
Alpha Version. It is still up to changes in the future. Welcome any comments and advice.