Question: Extracting Text by Chapters

Question

Question: Extracting Text by Chapters

BradKML opened this issue 2 years ago · 5 comments

Currently, I am trying to use a keyword extractors to extract chapters and paragraphs to create a reading aid, but EPUB is particular tricky in structure. What can be done?

Answer 1 · 2022-01-23T23:19:55.000Z

Very good question. What part of structure are you referring to? When I was importing content from EPUB and DOCX into our editing system was to do some analysis. This was mostly needed because not that many people used proper styles in Word and EPUB was usually result of some conversion with 3rd party tools which created strange output. Otherwise I would do proper import knowing that h tags were used for headers and blockquote for quotes for instance.

So after import I would cleanup everything I don't need and try to figure out if they used span or div with CSS for creating headers and etc. For DOCX I would analyse the size of the font and etc. but for EPUB I would do something simple as if it is one short line and then a lot of p or blocks of texts after it I would assume it is a line. If it was a lot of blocks of short text one after the other I would assume these are not titles and etc. It never worked properly but this was imported into the editing system where you could always change and reformat, so it was more then good for me.

What I would do is look how for instance Web Scraper for service instapaper.com, Rocket Readability or Readability in Safari works. I am sure there are Python projects which are trying to do that also. They seems to do fairly good work at cleaning up the garbage in the page and presenting only proper content without custom CSS. I guess that would be my start.

Answer 2 · 2022-01-24T03:01:38.000Z

There are libraries out there that does KPE, but right now I wanted to fi nd a way to get a list of chapters from the EPUB so I can pipe them into KPE algorithms. Don't EPUBs store individual chapters separately? If it is <p> wouldn't it be a paragraph rather than a chapter?