include_subsections seems to be ignored in get_sections()
matthew-at-qamcom opened this issue · 4 comments
I downloaded an example wikitext from Wikipedia: anarchism.wiki.txt
I do not understand why I get the same number of sections irrespective of the value of include_subsections
.
import wikitextparser as wtp
filename = "anarchism.wiki.txt"
with open(filename, "rt", encoding="utf-8") as infile:
wikitext = infile.read()
parsed = wtp.parse(wikitext)
# There are 29 sections here
print("USING SECTIONS")
print(len(parsed.sections))
# There are 29 sections here
print("\nWITH SUBSECTIONS")
sections = parsed.get_sections(include_subsections=True)
print(len(sections))
# But this also has 29 sections !?!
print("\nWITHOUT SUBSECTIONS")
no_subsections = parsed.get_sections(include_subsections=False)
print(len(no_subsections))
outputs:
USING SECTIONS
29
WITH SUBSECTIONS
29
WITHOUT SUBSECTIONS
29
The Wikipedia page of the same content shows that the article is structured with a number of sections, subsections, and sub-subsections:
The include_subsections
parameter only determines whether the parent section objects should include the text of their subsections or not, it does not change the number of returned sections:
import wikitextparser as wtp
wikitext = """
lead
=== 1 ===
text1
== 2 ==
text2
=== 3 ===
text3
"""
print("WITH SUBSECTIONS")
parsed = wtp.parse(wikitext)
sections = parsed.get_sections(include_subsections=True)
print(sections[2].string)
print("WITHOUT SUBSECTIONS")
no_subsections = parsed.get_sections(include_subsections=False)
print(no_subsections[2].string)
Which will print:
WITH SUBSECTIONS
== 2 ==
text2
=== 3 ===
text3
WITHOUT SUBSECTIONS
== 2 ==
text2
wikitextparser
currently does not provide an easy way for fetching sections that are not part of another section. You could specify the level
parameter if you know what the lowest level is (usually it's 2), or, you could iterate over all sections, determine the lowest level, and then filter the results.
If you find it useful, I could add a new parameter, e.g. top_level_only
or parentless_only
, for this purpose.
top_levels_only
argument was added in v0.55.0.
Thanks very much for your help (and for creating and maintaining wikitextparser!). In the end, my approach was to process the output of wikitextparser and turn it into a tree structure (using the index level of each section). This allowed me to then extract out the parts of the document that I was looking for.
Thanks again,
Matthew