Donc summary() won't work on this web site
MChrys opened this issue · 0 comments
MChrys commented
Summary() seem don't work on website where text is spliting() in many tag .
I encoutered this problem specifically on this web site :
https://start.lesechos.fr/actu-entreprises/services/a-19-ans-il-est-le-plus-jeune-patissier-prime-au-guide-michelin-13983.php
url = "https://start.lesechos.fr/actu-entreprises/services/a-19-ans-il-est-le-plus-jeune-patissier-prime-au-guide-michelin-13983.php"
page = requests.get(url).text
doc = Document(page)
doc.summary()
<html><body><div><div id="outer-main">\n\n\n\n<p class="ads tag1">\n\n</p>\n\n\n\n\n\n\n<a
href="" target="_blank" class="btn-piston "/>\n\n\n\n<article>\n<div id="content">\n<div
id="news">\n<div class="grid">\n<div class="contain">\n<div class="row">\n\n<div class="col
full">\n\n<span class="cat">Délices sucrés</span>\n<h1 class="page-title nobg">\nA 19 ans, il est le
plus jeune pâtissier primé au Guide Michelin</h1>\n<p class="meta">\n<span class="author">\nPar
Camille Wong</span>\n|\n<time datetime="2019-01-22T13:12">\n22/01/2019 à 14:30,</time>\nmis à
jour le 22/01/2019</p>\n\n\n<div class="picture first">\n<figure>\n\n<figcaption>\n<p
class="legend">Jessy Rhinn-Auvray (à gauche), 19 ans, et son mentor Nicolas Stamm, 46 ans, lors de la
cérémonie du Guide Michelin, le 21 janvier.\n <strong>@DR</strong
</p>\n</figcaption>\n</figure>\n</div>\n</div>\n\n</div>\n</div>\n</div>\n</div>\n</div>\n<
article>\n\n</div>\n\n\n</div></body></html>
almost all paragraph doesn't appear :
maybe you could add an option for Document object like :
if aggregation_mean == True:
aggregation = ""
max = self.select_best_candidate(candidates).score
min = self.select_worst_candidate(candidates).score
for c in candidates :
if c.score >= max-min :
aggregation += c.text
return aggregation
I just tried to activate readable mode on safari , it's working perfectly on this page, it seems based on arc 90's as well