equipe-crawler: A Perl repository from adbar

			EQUIPE-CRAWLER v1.1
		http://code.google.com/p/equipe-crawler/


=== LICENSE ===

This software is brought to you by Adrien Barbaresi (adrien.barbaresi@gmail.com).
It is freely available under the GNU GPL v3 license (http://www.gnu.org/licenses/gpl.html).

The texts gathered using this software are for personal (or academic) use only, you are not allowed to republish them :
http://www.lequipe.fr/Fonctions/pages_credits.html (in French)


=== DESCRIPTION ===

Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text file.

Due to its specialization it is able to build a reliable corpus consisting of texts and relevant metadata (info, title, excerpt, photo caption, date und url). The list of links which comes with the software features about 40.000 articles, which enables to gather more than 10 millions of tokens.

As you can't republish anything but quotations of the texts, the purpose of this tool is to enable others to make their own version of the corpus, as crawling is not explicitly forbidden by the right-holders of L'Équipe.

The crawler does not support multi-threading as this may not be considered a fair use, it takes the links one by one and it is not set up for speed. However, gathering a corpus should not take more than a day, depending on your downlink.

The export in XML format accounts for the compatibility with other software designed to complete a further analysis of the texts, for example the textometry software TXM : http://txm.sourceforge.net/


=== FILES ===

There are 5 different scripts and one text file in this release :
  ⋅ equipe-crawler.pl	[the crawler itself]
  ⋅ check_flatfile.pl	[to check the integrity of raw text data]
  ⋅ xmlize.pl		[to make one XML document out of the crawler output]
  ⋅ fragment+xmlize.pl	[to make a series of XML documents out of the crawler output]
  ⋅ clean_directory.sh	[to compress the files and remove the original ones]
  ⋅ EQUIPE_list_todo	[a sample list of links that may be used as an input for the crawler]


=== USAGE ===

Having stored the files in the same directory, here is a possible scenario :
  1. Edit the file named zeitcrawler.pl to select the number of pages you want to visit.
  2. Run it (the default setting is to take EQUIPE_list_todo as input) as many times as you like.
  3. Check the integrity of the EQUIPE_flatfile by running the check_flatfile.pl script.
  4. Run xmlize.pl or fragment+xmlize.pl to export the crawl in a (hopefully valid) XML document.
  5. Clean the directory.


=== RESTRICTIONS ===

The crawler was designed to get as little noise as possible. However, it is highly dependant on the content management system and on the HTML markup used by the newspaper L'Équipe. It worked by July 1st 2012, but it could break on future versions of the website. I may not update it on a regular basis.

The XML conversion was written with robustness in mind, but it does not provide a handy solution for all possible caveats, especially unicode (bad) character encoding issues. As the input may be a large corpus resulting from a web crawl, the scripts do not guarantee by design that the XML file will be valid.

All the scripts should work correctly on UNIX-like systems if you set the right permissions. They may need a software like Cygwin (http://www.cygwin.com/) to run on Windows, this case was not tested.


=== CHANGELOG ===

1st August 2012 : v1.1 – Speed and memory use improvements for the crawler : replacement of the CRC function, better and reordered regular expressions, use of a hash to remove duplicates, a few bugs removed.
adbar/equipe-crawler