some experiments with extracting information out of a subtitle file

applying some NLP tools and techniques to a subtitle-file

Each file here produces one or more .csv files into data folder. These files usually are used in subsequent files. So make sure to run files in chronological order. For further info see inline-documenation.

See jupyter notebook (information_extraction_from_subtitle_file.ipynb) for some results and interpreations (in German)

Install

install requirements (python -m pip install -r requirements.txt)
setup .env variables
run files in chronological order

Adjust .env settings

subtitles_url = a url to a subtitles.xml. file extracted_text_file = filename of extracted contents data_folder = folder where data lives, usually './data/', keep trailing slash

01_extract_textinfo_from_subtitles_xml.py take a subtitle - xml file (with xmls namspace (Timed Text Markup Language (TTML) 1.0) and extract text content. Text content is saved to file 1_{NAME}.txt Make sure to run this step, since the results of step 1) are not in repo, due to some copyright concerns.
02_spacy_term_extraction.py spacy NER does not realy work with these types of texts. Needs better findtuning. Common_words.csv instead seems to be usefull. Its simple lemmatized nouns ordered by their frequency
03_flair_term_extraction.py Doing Named Entity Exractions with flair library (LINK). switch model in use inside of file. is supposed to produce CSV files, named like model, i.e. _flair_ners_german-large.csv
04_aggr_csvs_and_do_some_counting.py takes results from former steps and aggregates data into a more readable format. It takes Named Entities of type "ORG" and "MISC" out of files produced by step (3)
05_split_german_compound_nouns.py splits german compound words ((zusammengesetzte Substantive)) into their parts.
e.g. "Gasheizung" -> "Gas","Heizung"
06_synonyms_from_wikidata.py Do some SPARQL-Queries in order to get synonyms for terms. The term list used here is manually extracted from Website's meta-keywords. Produces a CSV-File with Search-Term, Wikidata-Result, Wikidata-URL, Wikidata-ID, Aliases
07_synonyms_from_wikidata_with_results_from_05.py same as 6) but terms are extracted from files from steps 2) and 5)

About

Experiments with some NLP-Tools.
Any hints welcome. me@larslo.de

larsloQ/some-experiments-with-extracting-information-out-of-a-subtitle-file

some experiments with extracting information out of a subtitle file

Install

Adjust .env settings

About