attardi/wikiextractor

templates are not extracted correctly

vrnmthr opened this issue · 0 comments

I am trying to use this project to clean wikitext and extract plaintext, but it appears that not all templates are being expanded. The function I am calling to clean wikitext is (where I have set expand_templates=True) is the following:

def clean_markup(markup, keep_links=False, ignore_headers=True):

    if not keep_links:
        ignoreTag('a')

    extractor = Extractor(0, '', [])

    # returns a list of strings
    paragraphs = extractor.clean_text(markup,
                                      mark_headers=True,
                                      expand_templates=True,
                                      escape_doc=True)
    resetIgnoredTags()

    if ignore_headers:
        paragraphs = filter(lambda s: not s.startswith('## '), paragraphs)

    return paragraphs

The wikitext is:

The {{nihongo|'''Japan Women's Football League'''|\u65e5\u672c\u5973\u5b50\u30b5\u30c3\u30ab\u30fc\u30ea\u30fc\u30b0|lead=yes|extra=''Nihon Joshi Sakk\u0101 R\u012bgu''}}, commonly known as the {{nihongo|'''Nadeshiko League'''|\u306a\u3067\u3057\u3053\u30ea\u30fc\u30b0|lead=yes|extra=''Nadeshiko R\u012bgu''}}, is a semi-professional [[women's association football]] [[Sports league|league]] in Japan.\n\nThe Nadeshiko League consists of two divisions that correspond to the second and third levels of the [[Japanese association football league system#Women's|Japanese women's football pyramid]] respectively. 

And the output of the function is

The , commonly known as the , is a semi-professional women's association football league in Japan.\n\nThe Nadeshiko League consists of two divisions that correspond to the second and third levels of the Japanese women's football pyramid respectively. 

As we can see, the "Japan Women's Football" and "Nadeshiko League" are elided from the output. Capturing these strings is very important for my case.