Pinned Repositories
aquarium
Splash + HAProxy + Docker Compose
arachnado
Web Crawling UI and HTTP API, based on Scrapy and Tornado
autologin
A project to attempt to automatically login to a website given a single seed
deep-deep
Adaptive crawler which uses Reinforcement Learning methods
eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictions
Formasaurus
Formasaurus tells you the type of an HTML form and its fields using machine learning
html-text
Extract text from HTML
scrapy-rotating-proxies
use multiple proxies with Scrapy
sklearn-crfsuite
scikit-learn inspired API for CRFsuite
tensorboard_logger
Log TensorBoard events without touching TensorFlow
TeamHG-Memex's Repositories
TeamHG-Memex/eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictions
TeamHG-Memex/scrapy-rotating-proxies
use multiple proxies with Scrapy
TeamHG-Memex/tensorboard_logger
Log TensorBoard events without touching TensorFlow
TeamHG-Memex/sklearn-crfsuite
scikit-learn inspired API for CRFsuite
TeamHG-Memex/aquarium
Splash + HAProxy + Docker Compose
TeamHG-Memex/deep-deep
Adaptive crawler which uses Reinforcement Learning methods
TeamHG-Memex/arachnado
Web Crawling UI and HTTP API, based on Scrapy and Tornado
TeamHG-Memex/html-text
Extract text from HTML
TeamHG-Memex/autologin
A project to attempt to automatically login to a website given a single seed
TeamHG-Memex/Formasaurus
Formasaurus tells you the type of an HTML form and its fields using machine learning
TeamHG-Memex/autopager
Detect and classify pagination links
TeamHG-Memex/page-compare
Simple heuristic for measuring web page similarity (& data set)
TeamHG-Memex/scrapy-crawl-once
Scrapy middleware which allows to crawl only new content
TeamHG-Memex/undercrawler
A generic crawler
TeamHG-Memex/soft404
A classifier for detecting soft 404 pages
TeamHG-Memex/agnostic
Agnostic Database Migrations
TeamHG-Memex/autologin-middleware
Scrapy middleware for the autologin
TeamHG-Memex/json-lines
Read JSON lines (jl) files, including gzipped and broken
TeamHG-Memex/scrapy-kafka-export
Scrapy extension which writes crawled items to Kafka
TeamHG-Memex/MaybeDont
A component that tries to avoid downloading duplicate content
TeamHG-Memex/sitehound-frontend
Site Hound (previously THH) is a Domain Discovery Tool
TeamHG-Memex/domain-discovery-crawler
Broad crawler for domain discovery
TeamHG-Memex/url-summary
Show summary of a large number of URLs in a Jupyter Notebook
TeamHG-Memex/sitehound
This is the facade for installation and access to the individual components
TeamHG-Memex/docker-tor-rotator
A rotating socks proxy using Tor, Delegate and Haproxy
TeamHG-Memex/hh-page-classifier
Headless Horseman Page Classifier service
TeamHG-Memex/scrapy-cdr
Item definition and utils for storing items in CDR format for scrapy
TeamHG-Memex/scrash-lua-examples
A collection of example LUA scripts and JS utilities
TeamHG-Memex/sitehound-backend
Sitehound's backend
TeamHG-Memex/sshadduser
A simple tool to add a new user with OpenSSH keys.