Grep tool for HTML
$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello
$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}
# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
<p>hello</p>
<div>
<p>world</p>
</div>
</body>
EOM
hello
world
# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
world
# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
here
This is just a CLI for an awesome library, tanakh/easy-scraper.
- Install cargo
- Recommended Way: Install rustup
- Then,
cargo install web-grep
$ web-grep <QUERY> [INPUT]
The QUERY
is a HTML (XML) Pattern.
Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes.
web-grep
has various placeholders for cases.
If you need exact one placeholer in the pattern, use {}
.
<p>{}</p>
<p class="here">
<q>{}</q>
</p>
web-grep
outputs all texts matching for {}
.
$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3
<a href="{1}">{2}</a>
web-grep
outputs matched texts for {1}
, {2}
... in order, separated by \t
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga hoge
The delimiter can be specified with -F
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge
<a href="{href}">{innerHTML}</a>
The output can be formatted as JSON with --json
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}