some experiments with extracting information out of a subtitle file

applying some NLP tools and techniques to a subtitle-file

Each file here produces one or more .csv files into data folder. These files usually are used in subsequent files. So make sure to run files in chronological order. For further info see inline-documenation.

See jupyter notebook (information_extraction_from_subtitle_file.ipynb) for some results and interpreations (in German)


  • install requirements (python -m pip install -r requirements.txt)
  • setup .env variables
  • run files in chronological order

Adjust .env settings

subtitles_url = a url to a subtitles.xml. file extracted_text_file = filename of extracted contents data_folder = folder where data lives, usually './data/', keep trailing slash

  1. take a subtitle - xml file (with xmls namspace (Timed Text Markup Language (TTML) 1.0) and extract text content. Text content is saved to file 1_{NAME}.txt Make sure to run this step, since the results of step 1) are not in repo, due to some copyright concerns.

  2. spacy NER does not realy work with these types of texts. Needs better findtuning. Common_words.csv instead seems to be usefull. Its simple lemmatized nouns ordered by their frequency

  3. Doing Named Entity Exractions with flair library (LINK). switch model in use inside of file. is supposed to produce CSV files, named like model, i.e. _flair_ners_german-large.csv

  4. takes results from former steps and aggregates data into a more readable format. It takes Named Entities of type "ORG" and "MISC" out of files produced by step (3)

  5. splits german compound words ((zusammengesetzte Substantive)) into their parts.
    e.g. "Gasheizung" -> "Gas","Heizung"

  6. Do some SPARQL-Queries in order to get synonyms for terms. The term list used here is manually extracted from Website's meta-keywords. Produces a CSV-File with Search-Term, Wikidata-Result, Wikidata-URL, Wikidata-ID, Aliases

  7. same as 6) but terms are extracted from files from steps 2) and 5)


Experiments with some NLP-Tools.
Any hints welcome.