barrust/mediawiki

Method `page.sections` return html stuff in some cases

delchiaro opened this issue · 6 comments

Hello,
I'm using this library to get textual descriptions for classes in the CUB 2011 dataset.

For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections.
In some cases I get html codes inside the sections, for example:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)

output:
[u'Taxonomy and name', u'Subspecies<sup>&#91;8&#93;</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']

Then, if I use the page.section(str) method with the string u'Subspecies<sup>&#91;8&#93;</sup>':

print(page.section(page.sections[1]))

output: None

The correct string to find the object with the method page.sections(str) is simply 'Subspecies'.

I actually managed to fix this issue implementing this method:

def fixed_sections(page_content, verbose=False):
    sections = []
    import re
    section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
    found_obj = re.findall( section_regexp, page.content)
    
    if found_obj is not None:
        for obj in found_obj:
            obj = obj.lstrip('\n= ').rstrip(' =\n')
            sections.append(obj)
            if verbose: print("Found section: {}".format(obj))
    return sections

correct_sections  = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))

With this code I get the correct output, i.e. the content of the section (sub-section in this case):

[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.

This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.

Thank you for your interest. I noticed something like this long ago but forgot to get back to it. As sections are only used on demand I am not opposed to using regex. If you want to submit a PR to fix the sections title parsing I would love to review it!

I think 2 or three test will fail once this is changed. If you submit a PR and they are failing, I can help fix them!

@nagash91 I had some time this evening so I incorporated your change into the 0.3.17 branch. Thank you for the code to make this change! I will likely merge this into the main branch in a day or so and then push an updated version to pypi.

This has been published in version 0.4.0; please let me know if you encounter further issues!

@barrust I tried your last version and the bug is fixed.
Thank you and sorry for not uploading the fix, I was really busy in this period.

No problem! Glad it worked and thank you for reporting and providing the solution!