Pinned Repositories
CommonCrawl
Common Crawl's processing tools
IpAddressEnumeration
IP address enumerators
RobotsProtocol
Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag
SitemapsProtocol
Parsers for sitemap / sitemap index (aka Sitemaps Protocol)
UrlNormalization
URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical
WarcProtocol
Parser for WARC (aka WebArchive) files
Wikimedia
Wikimedia Downloads' processing tools
Toimik's Repositories
toimik/WarcProtocol
Parser for WARC (aka WebArchive) files
toimik/CommonCrawl
Common Crawl's processing tools
toimik/UrlNormalization
URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical
toimik/IpAddressEnumeration
IP address enumerators
toimik/RobotsProtocol
Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag
toimik/SitemapsProtocol
Parsers for sitemap / sitemap index (aka Sitemaps Protocol)
toimik/Wikimedia
Wikimedia Downloads' processing tools