/webgrep

Webgrep: Recursively grep the web

Primary LanguageRacket

Webgrep: grep the web

What?

This is a tool for performing recursive regular expression content searches of a webpage. Recursive here means exploring pages linked to in the content of a page.

Usage

usage: webgrep.rkt [ <option> ... ]
 where <option> is one of
 *-c <regexp>, --content-pat <regexp> : Patterns to find in page contents.
  -u <regexp>, --url-pat <regexp> : Pattern of links to find in page contents.
    Default: any link
  -r <url>, --root <url> : Url at which to root search.
  -d <natural>, --depth <natural> : Depth of linked pages to traverse.
    Default: 2
  -l, --links-only : Only show links of pages with matching content,
    not content matches as well.
  -j <natural>, --thread-limit <natural> : Maximum number of threads to search in parallel.
    Default: 50
  --help, -h : Show this help
  -- : Do not treat any remaining argument as a switch (at this level)
 * Asterisks indicate options allowed multiple times.
 Multiple single-letter switches can be combined after one `-'; for
  example: `-h-' is the same as `-h --'
usage: wgrep.py [-h] [-l LINK] [-u URLS] [-c CONTENT] [-d DEPTH] [-n]

optional arguments:
  -h, --help            show this help message and exit
  -l LINK, --link LINK  The link to root the search. *This is a mandatory
                        argument.*
  -u URLS, --urls URLS  Regexp for urls to search. By default all urls are
                        searched.
  -c CONTENT, --content CONTENT
                        Regexp for page content to find. *This is a mandatory
                        argument.*
  -d DEPTH, --depth DEPTH
                        The page depth to search. Default: 2.
  -n, --negate          Negate content regexp: Print only page urls that do
                        *not* contain the content pattern.

Dependencies

Racket version:

Python version:

Future work

  • Pretty printing with colors
  • More machine-parsable output
  • Option for case sensitivity
  • Multiple patterns
  • Improve performance (?) Use the racket implementation!

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.