/pasta-link-checker

An utility to find broken links in .md files in a repo or on a HTML site

Primary LanguagePHP

URL checker

Build Status

Finds broken links in .md files in a github repository.

Checks all links found on the site or in .md files from a github repository. Default starting URL is hardcoded as https://github.com/codedokode/pasta/blob/master/README.md , but it can be changed using CLI arguments.

The scripts visits all pages on the site, finds all links within them and checks response status for those links. The list of broken links is printed to console.

URL checker makes pauses between requests. It also uses filesystem cache.

Installation

  • git clone
  • composer install

Usage

php checker.php -u http://example.com/

Type php checker.php --help for help.

Testing

Choose an unused port number, start a temporary web server.

php -S 127.0.0.1:10001 -t tests/public/

Then run tests using phpunit in a separate console:

export LINK_CHECKER_TEST_SERVER_PORT=10001
phpunit

or use run-tests.sh shell script.

Known problems / TODO

  • script considers all non-html pages to be invalid (PDF, images)
  • script cannot detect parked domains
  • check fragments (page.html#something)
  • use HEAD requests for leaf pages where possible
  • don't cache and don't even load huge files
  • be able to check local HTML files
  • check image/css/js references
  • pick URLs from queue so that we don't have to wait
  • find and report redirects
  • maybe use delay based on last 2 domain parts, not whole domain
  • maybe obey robots.txt?
  • links like https://mega.nz/#!12345 , https://rghost.net/12345 are not checked properly
  • support some other 2xx codes like 203