samuelowino/spider
This spider can crawl a website, and return a clean form of the content on the website/page in a nice web of http response, it also include PDF Text extraction and processing by use of Regex for content extraction from pdf files.
Java