attardi/wikiextractor

TagRE Causes Loss of Large Portion of Page Text

lorr1 opened this issue · 0 comments

lorr1 commented

The tag re here will cause a break whenever seeing a <br> or <ref> tag which occur frequently in text.

For example, the first portion of the <text tag here

<text bytes="96841" xml:space="preserve">{{short description|Mental or emotional state of well-being characterized by pleasant emotions}} {{Redir
ect-several|Happiness|Happy|Gladness|Jolly|}} {{Redirect|Enjoyment|the 2005 video album by Kaiser Chiefs|Enjoyment (video)}} {{Redirect|Cheerful|Royal 
Navy destroyer|HMS Cheerful (1897)}} {{pp|small=yes}} {{Use dmy dates|date=September 2021}} [[File:My Grandfather Photo from January 17.JPG|thumb|uprig
ht=1.2|A smiling 95-year-old man from [[Pichilemu]], Chile]] {{Emotion}} The term '''''happiness''''' is used in the context of [[Mental health|mental]
] or [[emotion]]al states, including positive or [[Pleasure|pleasant]] emotions ranging from [[contentment]] to intense [[joy]].<ref name="auto">{{cite
 web |url=http://www.wolframalpha.com/input/?i=happiness&a=*C.happiness-_*Word- |title=happiness |publisher=Wolfram Alpha |access-date=24 February 2011
 |archive-url=https://web.archive.org/web/20110718075432/http://www.wolframalpha.com/input/?i=happiness&a=*C.happiness-_*Word- |archive-date=18 July 20
11 |url-status=dead }}</ref> It is also used in the context

Breaks and stops recording text at the first instance of <ref name="auto">. This means the extractor only sees the first few sentences of this page, and these ref/br tags occur in many important pages.