Formatting the saved pages
laughingclouds opened this issue · 2 comments
Is your feature request related to a problem? Please describe.
- Opening the html page is a nightmare. It's ugly.
- There's no darkmode.
- The svgs that exist in the html code are oversized.
- Someone probably used tailwind/bootstrap for styling the page.
Describe the solution you'd like
- (Default) Dark mode for the text
- A way to clean the unnecessary html code
Describe alternatives you've considered
- Create one more CLI (lol) to work with the html code.
- Use bs4 maybe
Additional context
None
I tried bs4 a little bit. There are many ways of separating the content from the rest of the html document.
One way might be
# p.text represents code within paragraph tag
for p in soup.findAll("p"):
print(p.text)
But when I ran this against a document, the output was garbled.
I checked the doc and there were too many '\n' characters within the paragraphs.
What we could do is format the text within every paragraph.
So we save a bunch of desired tags, and insert them all in an html template.
I was also thinking of storing that "template" html code along with a style rules in a separate place.
This piece of code does a good job with dealing with the text formatting. It needs improvements.
from bs4 import BeautifulSoup
def fixLine(lineText: str):
"""lineText is a single line of a paragraph"""
words = lineText.split()
newText = " ".join([word for word in words if word != " "])
return newText
def fixPara(pText: str):
"""pText is text within a paragraph tag"""
words = lineText.split()
newText = " ".join([word for word in words if word != " "])
return newText
fName = "HTML_FILE_NAME"
with open(fName) as fp:
soup = BeautifulSoup(fp, "html.parser")
s = ""
for p in soup.findAll("p"):
s += fixPara(p.text) + '\n'
s = s.rstrip('\n')