/wired-it-scraper

A Scrapy crawler the articles published on http://www.wired.it

Primary LanguagePythonApache License 2.0Apache-2.0

Overview

This project implements a Scrapy scraper for the articles of the Italian website http://www.wired.it.

Please, note that the articels are published - at the time of writing - with a Creative Commons license.

The crawler follows the site map and, for each article, extracts:

  • category
  • copyright
  • text
  • title
  • URL

Data snapshots

Snapshots of the data can be found in the corpora folder.

Goals

The crawled data are meant to be used as "training corpus" in Automatic Text Classification tasks - as explained in the essay "What is the best method for Automatic Text Classification?".