Filimoa/open-parse

No whitespace in text?

Closed this issue · 2 comments

Discussed in #49

Originally posted by JBGruber June 7, 2024
I tried to parse a few complex PDFs, which worked really well. Now I put in a simpler one and was suprised to see that the result contains no whitespace. Not sure if I'm doing something wrong or if this might be a bug:

import openparse
import urllib.request
urllib.request.urlretrieve("https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0207996&type=printable", "test.pdf")

basic_doc_path = "test.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

print(parsed_basic_doc.nodes[2].text)
#> Abstract<br><br>a1111111111
#> a1111111111
#> a1111111111
#> a1111111111
#> a1111111111<br><br>**Introduction**<br><br>Exploitinginformationinhealth-relatedsocialmediaservicesisofgreatinterestforpatients,
#> researchersandmedicalcompanies.Thechallengeis,however,toprovideeasy,quickand
#> relevantaccesstothevastamountofinformationthatisavailable.Onesteptowardsfacili-
#> tatinginformationaccesstoonlinehealthdataisopinionmining.Eventhoughtheclassifica-
#> tionofpatientopinionsintopositiveandnegativehasbeenpreviouslytackled,mostworks
#> makeuseofmachinelearningmethodsandbagsofwords.Ourfirstcontributionisanexten-
#> siveevaluationofdifferentfeatures,includinglexical,syntactic,semantic,network-based,
#> sentiment-basedandwordembeddingsfeaturestorepresentpatient-authoredtextsfor
#> polarityclassification.Thesecondcontributionofthisworkisthestudyofpolarfacts(i.e.
#> objectiveinformationwithpolarconnotations).Traditionally,thepresenceofpolarfactshas
#> beenneglectedandresearchinpolarityclassificationhasbeenboundedtoopinionated
#> texts.Wedemonstratetheexistenceandimportanceofpolarfactsforthepolarityclassifica-
#> tionofhealthinformation.
#> **Received:**January30,2018

Using copy and paste in a PDF reader, it looks like this:

Exploiting information in health-related social media services is of great interest for patients,
researchers and medical companies. The challenge is, however, to provide easy, quick and
relevant access to the vast amount of information that is available. One step towards facili-
tating information access to online health data is opinion mining. Even though the classifica-
tion of patient opinions into positive and negative has been previously tackled, most works
make use of machine learning methods and bags of words. Our first contribution is an exten-
sive evaluation of different features, including lexical, syntactic, semantic, network-based,
sentiment-based and word embeddings features to represent patient-authored texts for
polarity classification. The second contribution of this work is the study of polar facts (i.e.
objective information with polar connotations). Traditionally, the presence of polar facts has
been neglected and research in polarity classification has been bounded to opinionated
texts. We demonstrate the existence and importance of polar facts for the polarity classifica-
tion of health information.

Smae issue here. Have you solved the issue?

Fixed with v0.5.7