/derl

CLI utility for finding dead URLs inside a lot of files - 🏹 🧟

Primary LanguagePythonMIT LicenseMIT

derl

Build Status Coverage Status Quality Gate Status

Overview / Features / Install / Run / Usage / Development / Structures / Links

Command Line Interface (CLI) utility for searching dead URLs inside files

The CLI utility takes a directory, finds all files recursively and looks for valid URLs. For every URL an HTTP GET request is sent. All returning HTTP Status Codes are gathered in a list which is written to stdout, can be sorted, filtered and further processed with tools like sed, awk or grep.

  • Iterating over directories and gathering a list of all files.

  • Search for valid URLs (http and https) inside the files and store all found URLs

  • Send an optional HTTP GET request to all URLs with custom timeout and retry (soon multi-threaded)

  • Record all returning HTTP Status Codes

  • Output a list of files, urls and line numbers (optional with context up to 3 lines)

  • Common verbosity by default arguments (-v|-vv) with additional output for information and debugging

  • Collect statistics about processed directories, files, lines, URLs and sent requests

  • Track running time for processing files, searching URLs and dispatching requests

  • Utilities name sounds like one guy hunting for other dead things in the 10th season already ;)

Limitations

  • At the moment only UTF-8 is supported, relative paths are saved and no binary files are processed.
# Makefile targets without a Python Virtual Environment
make requirements install-user

# Or without makefile inside a Python Virtual Environment
python -m venv .venv_run
source .venv_run/bin/activate
pip install -r requirements.txt
python setup.py install --user --record files.log
deactivate

This installation will copy files to $HOME/.local/ and create files.log. This log stores all installed files for convenience. To uninstall run the following:

# Makefile target
make uninstall

# Or without makefile something like this:
xargs rm -rvf < files.log && rm -fv files.log
derl --dispatch directory

Output

$ derl --dispatch tests/test-directory/

tests/test-directory/dir-1/dir-2/test-4-dir-2.txt:1, 200, http://www.python.org/
tests/test-directory/dir-1/dir-2/test-4-dir-2.txt:4, 404, http://docs.python.org/something

# [...]

$ derl --context --dispatch tests/test-directory/

tests/test-directory/dir-1/dir-2/test-4-dir-2.txt:1, 200, http://www.python.org/
  Sed condimentum efficitur orci, sed mollis tellus mollis a. Nullam http://www.python.org/
  tempus magna ac felis iaculis rhoncus. Ut in sodales lectus. Integer vestibulum malesuada

tests/test-directory/dir-1/dir-2/test-4-dir-2.txt:4, 404, http://docs.python.org/something
  ullamcorper. Integer quis ultricies odio. Fusce tincidunt a ligula id blandit. Integer
  dignissim blandit turpis ac maximus. Donec http://docs.python.org/something eget justo tempus,
  mauris.

# [...]

$ derl --stats --dispatch tests/test-directory/

# [...]
tests/test-directory/test-2-dir-0.txt:3, 404, http://www.dlqx.de/test

Finished checking URLs after 1.00 second(s).
Processed Directories/Files/Lines/Tokens/URLs: 3/7/42/491/7
Sent HTTP GET Requests: 7
derl [-h] [-c] [-d] [-r RETRY] [-s] [-t TIMEOUT] [--version] [-v] [-vv] directory

Dead URL searching utility

positional arguments:
  directory                      directory for looking for dead URLs

optional arguments:
  -h, --help                     show this help message and exit
  -c, --context                  showing up to 3 lines of context
  -d, --dispatch                 dispatching HTTP requests for every found URL
  -r RETRY, --retry RETRY        set how often to retry a request (default is 3)
  -s, --stats                    track and print statistics at the end
  -t TIMEOUT, --timeout TIMEOUT  set timeout for requests in seconds (default is 10)
  --version                      show program's version number and exit
  -v, --verbose                  set loglevel to INFO
  -vv, --very-verbose            set loglevel to DEBUG

Requirements, Tests and Development

# Makefile targets
make requirements test develop

# Or without Makefile
python -m venv .venv_run
source .venv_run/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
python setup.py test
python setup.py develop
deactivate

Linting

# Linting project
make lint

# Generating report
make report

Data structure

files: [
  {
    filename,
    urls: [
      (0): {
        url,
        status_code,
        line_number
        context: [
          "line above matched line"
          "line with found URL",
          "line below matched line"
        ]
      },
      (1): {
        url,
        status_code,
        line_number
        context: [
          "line above matched line"
          "line with found URL",
          "line below matched line"
        ]
      },

      ...

      (n): {
        url,
        status_code,
        line_number
        context: [
          "line above matched line"
          "line with found URL",
          "line below matched line"
        ]
      }
    ]
  }
]

Test directory structure

test-directory/
β”œβ”€β”€ dir-1
β”‚   β”œβ”€β”€ dir-2
β”‚   β”‚   β”œβ”€β”€ test-4-dir-2.txt
β”‚   β”‚   └── test-6-dir-2.txt
β”‚   β”œβ”€β”€ test-3-dir-1.txt
β”‚   β”œβ”€β”€ test-5-dir-1
β”‚   └── test-7-dir-1.txt
β”œβ”€β”€ test-1-dir-0.txt
└── test-2-dir-0.txt

Recreating reference output

# Makefile target
make update-references

# Or without Makefile
derl tests/test-directory/ > tests/references/output-without-context-without-dispatch.out && \
derl tests/test-directory/ --context > tests/references/output-with-context-without-dispatch.out && \
derl tests/test-directory/ -d > tests/references/output-without-context-with-dispatch.out && \
derl tests/test-directory/ --context --dispatch > tests/references/output-with-context-with-dispatch.out