/Processing-ELTeC-corpus

Some scripts to exploit ELTeC corpus

Primary LanguageJupyter Notebook

What can we do (and what do we need) to exploit the ELTeC corpus. Some examples.

Borja Navarro Colorado | University of Alicante

Introduction

This INTELE webinar shows how to exploit the ELTeC corpus for literary studies with some examples. Except for the last one, these examples are implemented and explained in COLAB notebooks, so you can run them in your machine. They explore the next topics:

  • how to open and process the ELTeC corpus with Python in COLAB;
  • how to extract information annotated in XML;
  • how to analyze the ELTeC corpus with basic NLP techniques;
  • and finally a simple proposal to overcome language barriers.

Where is the ELTeC corpus?

Extracting information from XML

  1. Extracting author and gender from one collection (ELTeC-SPA)
  2. Extracting author and gender from two (or more) collection (ELTeC-SPA and ELTeC-ENG)
  3. Extracting code switchig

Applying basic NLP techniques to analyze the ELTeC corpus (with SpaCy)

  1. Analyzing Part of Speech of a novel from the ELTeC-SPA with SpaCy.

Overcoming language barriers

Only an example about how to extract stylometric relations between novels from several languages. Unfortunatelly it is not possible to do it in COLAB.

Inter-lingual representation based on WordNet synsets. Stylometric relations extracted with R package "Stylo".

Some results: