barrust/mediawiki

parse_section_links does not allow parsing links in the intro (summary)

ldorigo opened this issue · 3 comments

Hi, would it be possible to let page.parse_section_links also parse links in the first paragraph of a page? Since it isn't part of a section, it's currently impossible to extract links from it...

@ldorigo, can you provide an example of what you are seeing and wanting to do? A wikipage that I can review to see what you are seeing and what you would like to accomplish? I am not sure when I will be able to get to anything but PRs are always welcome!

For instance, for https://en.wikipedia.org/wiki/Caffeine : if I do

page = mw.page("Caffeine")
page.sections

It gives me the list

['Use',
 'Medical',
 'Enhancing performance',
 'Cognitive',
 'Physical',
 'Specific populations',
 'Adults',
...]

And I can then get the links from those sections with page.parse_section_links(<section_name>). However, the first part of the page ("Caffeine is a central nervous system (CNS) stimulant of the methylxanthine class....") is not part of any section, and thus it isn't currently possible to extract links from it.

Thanks for looking into it and for providing the library !

I've submitted two pull requests, one is a small optimization and the other one addresses this issue. You can review them when you have time.