Formatting the saved pages

Question

Formatting the saved pages

laughingclouds opened this issue 3 years ago · 2 comments

laughingclouds commented 3 years ago

Is your feature request related to a problem? Please describe.

Opening the html page is a nightmare. It's ugly.
There's no darkmode.
The svgs that exist in the html code are oversized.
Someone probably used tailwind/bootstrap for styling the page.

Describe the solution you'd like

(Default) Dark mode for the text
A way to clean the unnecessary html code

Describe alternatives you've considered

Create one more CLI (lol) to work with the html code.
Use bs4 maybe

Additional context
None

Answer 1 · 2022-02-07T12:49:02.000Z

I tried bs4 a little bit. There are many ways of separating the content from the rest of the html document.

One way might be

# p.text represents code within paragraph tag
for p in soup.findAll("p"):
   print(p.text)

But when I ran this against a document, the output was garbled.
I checked the doc and there were too many '\n' characters within the paragraphs.

What we could do is format the text within every paragraph.
So we save a bunch of desired tags, and insert them all in an html template.

I was also thinking of storing that "template" html code along with a style rules in a separate place.

Answer 2 · 2022-02-07T13:06:30.000Z

This piece of code does a good job with dealing with the text formatting. It needs improvements.

from bs4 import BeautifulSoup


def fixLine(lineText: str):
    """lineText is a single line of a paragraph"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

def fixPara(pText: str):
    """pText is text within a paragraph tag"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

fName = "HTML_FILE_NAME"
with open(fName) as fp:
    soup = BeautifulSoup(fp, "html.parser")
s = ""
for p in soup.findAll("p"):
    s += fixPara(p.text) + '\n'
s = s.rstrip('\n')