A tool to extract metadata terms from web pages
This tool based on crawler4j can crawl a website and extract text from webpages that are tagged in some way for example, if an HTML document contains an element like:
<span class="item-title">Hello World!</span>
It can extract the text "Hello World!" based on the item-title
class for further analysis, perhaps by passing it to some semantic annotation tool such as Apache Stanbol.
The steps involved in this process are as follows:
- Provide a seed URL for the crawl process to start.
- The crawler creates a file of discovered URLs.
- The URLs are read from the file and requested over HTTP. The resulting HTML is saved to file.
- The HTML files are converted to XHTML using HTML Tidy.
- An XPath reader extracts the tagged terms and creates a text file of the terms.