laughingclouds/Scrapia-World

Formatting the saved pages

laughingclouds opened this issue · 2 comments

Is your feature request related to a problem? Please describe.

  • Opening the html page is a nightmare. It's ugly.
  • There's no darkmode.
  • The svgs that exist in the html code are oversized.
  • Someone probably used tailwind/bootstrap for styling the page.

Describe the solution you'd like

  • (Default) Dark mode for the text
  • A way to clean the unnecessary html code

Describe alternatives you've considered

  • Create one more CLI (lol) to work with the html code.
  • Use bs4 maybe

Additional context
None

I tried bs4 a little bit. There are many ways of separating the content from the rest of the html document.

One way might be

# p.text represents code within paragraph tag
for p in soup.findAll("p"):
   print(p.text)

But when I ran this against a document, the output was garbled.
I checked the doc and there were too many '\n' characters within the paragraphs.

What we could do is format the text within every paragraph.
So we save a bunch of desired tags, and insert them all in an html template.

I was also thinking of storing that "template" html code along with a style rules in a separate place.

This piece of code does a good job with dealing with the text formatting. It needs improvements.

from bs4 import BeautifulSoup


def fixLine(lineText: str):
    """lineText is a single line of a paragraph"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

def fixPara(pText: str):
    """pText is text within a paragraph tag"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

fName = "HTML_FILE_NAME"
with open(fName) as fp:
    soup = BeautifulSoup(fp, "html.parser")
s = ""
for p in soup.findAll("p"):
    s += fixPara(p.text) + '\n'
s = s.rstrip('\n')