Wextract.py

A CLI program to read a html/xml stream from stdin, extracts text and prints it to stdout.

Install

Install Python3 and pip as follows in Ubuntu/Debian Linux:

sudo apt install python3 python3-pip

Install dependencies:

pip3 install lxml bs4

Download Wextract.py and set execute permissions:

curl -LJO https://raw.githubusercontent.com/byte-cook/wextract/main/wextract.py
chmod +x wextract.py

(Optional) Use opt.py to install it to the /opt directory:

sudo opt.py install wextract wextract.py

Usage examples

Show help:

wextract.py -h

Make a simple list from a html table without first header row:

cat file.html | wextract.py -l td -s "table tr" td text - ": " "td:nth-child(2)" text

Explanation:
-l td : skip line if text is empty -s "table tr" : select tr tag of table as root element (all sub elements are run through)
td text : print text of td tag
- ": " : print ": " as separator
"td:nth-child(2)" text : print text of the second td tag