5j9/wikitextparser

include_subsections seems to be ignored in get_sections()

matthew-at-qamcom opened this issue · 4 comments

I downloaded an example wikitext from Wikipedia: anarchism.wiki.txt

I do not understand why I get the same number of sections irrespective of the value of include_subsections.

import wikitextparser as wtp

filename = "anarchism.wiki.txt"
with open(filename, "rt", encoding="utf-8") as infile:
    wikitext = infile.read()

parsed = wtp.parse(wikitext)

# There are 29 sections here                                                                             
print("USING SECTIONS")
print(len(parsed.sections))

# There are 29 sections here                                                                             
print("\nWITH SUBSECTIONS")
sections = parsed.get_sections(include_subsections=True)
print(len(sections))

# But this also has 29 sections !?!                                                                      
print("\nWITHOUT SUBSECTIONS")
no_subsections = parsed.get_sections(include_subsections=False)
print(len(no_subsections))

outputs:

USING SECTIONS
29

WITH SUBSECTIONS
29

WITHOUT SUBSECTIONS
29

The Wikipedia page of the same content shows that the article is structured with a number of sections, subsections, and sub-subsections:
image

5j9 commented

The include_subsections parameter only determines whether the parent section objects should include the text of their subsections or not, it does not change the number of returned sections:

import wikitextparser as wtp

wikitext = """
lead

=== 1 ===
text1

== 2 ==
text2

=== 3 ===
text3
"""

print("WITH SUBSECTIONS")
parsed = wtp.parse(wikitext)
sections = parsed.get_sections(include_subsections=True)
print(sections[2].string)

print("WITHOUT SUBSECTIONS")
no_subsections = parsed.get_sections(include_subsections=False)
print(no_subsections[2].string)

Which will print:

WITH SUBSECTIONS
== 2 ==
text2

=== 3 ===
text3


WITHOUT SUBSECTIONS
== 2 ==
text2

wikitextparser currently does not provide an easy way for fetching sections that are not part of another section. You could specify the level parameter if you know what the lowest level is (usually it's 2), or, you could iterate over all sections, determine the lowest level, and then filter the results.

If you find it useful, I could add a new parameter, e.g. top_level_only or parentless_only, for this purpose.

5j9 commented

top_levels_only argument was added in v0.55.0.

Thanks very much for your help (and for creating and maintaining wikitextparser!). In the end, my approach was to process the output of wikitextparser and turn it into a tree structure (using the index level of each section). This allowed me to then extract out the parts of the document that I was looking for.

Thanks again,

Matthew