/open_web_data_mining

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Domain Metadata Analysis

  • Root Domain Crawl

    • Javascript / Cookie Tracking
    • Javascript Libs
    • SSL available
    • Page Speed
    • Domain Whois Data
    • Security Issues
    • HTTP Server
    • HTTP Protocol
    • structured Data (schema.org)
    • Used HTML Tags ("iframe", "svg", ...)
    • Content Management Systems
    • PHP Versions
    • RSS/Atom feeds
  • Full Domain Crawl

    • Match Tracking Data with data privacy statement
    • Referrer
    • Redirects
    • Broken Links
  • time consuming Crawl

    • SSL Implementation / Rating
    • HTML Validation (w3.org)
    • Ports (MySQL, MongoDB, ...)

Other similar Projects

Domain Lists

Used Libs and Formats

Splash - Lightweight, scriptable browser as a service with an HTTP API

adblockparser - Parser for Adblock Plus rules

HTTP Archive format (HAR)

HTTP Archive format (HAR) Viewer

Publish

Keywords

"Webometrie" "Webometrics" "Cybermetrics" "Web Mining" "Internet Data Mining", "Internet Research", "Internet Technologie Trends"

Crawler Performance without Threads

avg sec. * domain count = duration sec. / 86400 = duration days 5 * 1000000 = 5000000 / 86400 = 57.8 days