This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.
pip install wiktionary-de-parser
- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
- Allows you to add your own extraction methods (pass them as argument)
- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if 'lang_code' not in record or record['lang_code'] != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
An extraction method takes the following arguments:
title
(string): The title of the current Wiktionary pagetext
(string): The Wikitext of the current word entry/sectioncurrent_record
(Dict): A dictionary with all values of the current iteration (e. g.current_record['lang_code']
)
It must return a Dict
with the results or False
if the record was processed unsuccesfully.
# Create a new extraction method
def my_method(title, text, current_record):
# do stuff
return {'my_field': my_data} if my_data else False
# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
print(record['my_field'])
Example output for the word "Abend":
{'flexion': {'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
'inflected': False,
'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
'lang': 'Deutsch',
'lang_code': 'de',
'lemma': 'Abend',
'pos': {'Substantiv': []},
'rhymes': ['aːbn̩t'],
'syllables': ['Abend'],
'title': 'Abend'}
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry install
inside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.py
to your needs. - Run
poetry run python wiktionary_de_parser/run.py
to run the parser. Orpoetry run pytest
to run tests.
MIT © Gregor Weichbrodt