python_pandoc_webscrapper
Running python scripts (not included), converting html files to Pandoc using Docker
Install
docker compose build
Python requirements can be configured in the Dockerfile, right now the main requirement for webscrapping is BeautifulSoup4, if you need anything else, configure as needed.
How to use
From python
Create your python script in the py
folder. The idea is for you to print HTML, which will then be converted by Pandoc.
An example is given through the py/test.py
script, you then have to add the webscrapper logic as you see fit.
./python.sh 'filename'
The result will be in the epub
folder, under the same name of the file.
From Ready-made HTML file
If you already have an HTML file that you made in another way, just put it in the html
folder, and run the command below.
./pandoc.sh 'filename'
The result will be in the epub
folder, under the same name of the file.
Logic
Instead of having to install all the requirements in different machines, with this method I can just use docker instead, which makes it more plataform independant. However because Docker requires an static command to run, I'm using shellscript to deal with that.
- Check if the
{filename}
(.py) you gave exists, and writes a script calledshell-run-for-python.sh
which will then be run by the Docker container for python. - The script must write the whole HTML in the output, which the shellscript itself will forward to an
html\{filename}.html
file. - Runs the
pandoc.sh
script under the same rules, creating a scriptshell-run-for-pandoc.sh
which will then convert the HTML file toepub\{filename}.epub
If you already have the HTML, you start from step 3.
Testing
I'm not a python programmer (just did out of necessity), so I'm unsure how things are usually done professionally. So I made the test-python.sh
to see how things return. Of course, most Linux systems (Or Windows under WSL2) have Python3 natively, so in that regard, you should be able to debug it natively.
All of what the scripts do are to ease and organize a way to run the scripts and convert files going from script to epub.
Some websites have protections, so you may need another solution (front-end), where you will aim for a straight HTML response, you then save that into a file and just use the 2nd part of this script to convert into an EPUB.
Of course, you are welcome to change the types of files you want to convert FROM and TO via the scripts