This repo contains two tools to scrape EURLEX and process publications:
- A scraper progam that fetches the HTML from EURLEX, taking into account certain restrictions on which kind of documents should be attempted to scrape.
- A processor program, that processes the HTML and writes a CSV file.
Each EURLEX publication has a unique identifier, the CELEX number.
It consists of three components: 3YYYYLNNNN
, where 3
is fixed, YYYY
denotes the year, L
is
a "letter", indicating what kind of publication it is, and NNNN
is a unique number identifiying
the publication within a year and kind of publication.
The scraper can be invoked with the following commands:
java -jar scraper.jar scrape <YEAR> <OPTIONAL OVERRIDE LETTERS> <OPTIONAL START NUMBER AND END NUMBER>
Only the year must be provided, optionally multiple letters can be added, separated by spaces. It is also possible to limit the range of numbers, note that both a start and end number must be provided in that case.
If no overriding letters are provided, the default letters D
, L
, R
and M
are scraped.
The scraped data is stored in the data
directory, in the current directory.
This data
directory will contain folders for each year.
Each year directory contains two files, found.txt
and not-found.txt
, storing state on
which CELEX numbers have been scraped, and which could not be found respectively.
When a CELEX number does not exist, the HTTP server will return a 404 not found
status.
In this case, we mark the document as non existing, and append the number to the not-found.txt
file.
NOTE: Therefore removing or editing this file makes it possible to re-try non-existing documents
for some given year / range.
Removing the found.txt
will result in having to re-scrape the entire year and is not recommended;
if you want to re-scrape a complete year, you should also remove the html directory to save disk
space.
- Scrape all default letters in 2014:
java -jar scrape.jar scrape 2014
- Scrape only R and M letters for 2014, check all document numbers:
java -jar scraper.jar scrape 2014 R M
- Scrape only L and D letters for 2014, check a custom range:
java -jar scraper.jar scrape 2014 L D 20 200
Two types of failures are attempted are distinguished. Timeouts to fetch the HTML, and any other error raised during the process. The latter are very unlikey to happen.
When a request to fetch a page does not return a result within 60 seconds, it it scheduled to be attempted again after 60 seconds. If any other (not yet-tried) document can be scraped, they are, until the 60 secons have passed. This implies that retries-after-failure have a higher priority than attempting to scrape not-yet attempted documents. The reason for this is that the alternative is to queue all failures to the end of the run, after which you'd have to attempt to scrape, and wait again for some time to retry if they fail again. This easily results in significant longer run time.
Documents that trigger a timeout are attempted at most ten times. Documents that trigger an exception are attempted at maximum three times.
Note: So far, I have not encountered any document that could not be scraped in the end.
In the worst case, the program can be run multiple times after each other, since it would
rember it's progress.
If the processor complains about some document not being scraped, a work around could be to add the
CELEX numer to the not-found.txt
file.
Processing the scraped data is done with the processor
command.
It accepts exactly the same filters as the scrape
command, except that an output file name must
be specified.
java -jar scraper.jar process <YEAR> <OPTIONAL OVERRIDE LETTERS> <OPTIONAL START NUMBER AND END NUMBER> <REQUIRED OUTPUT FILE NAME>
Note that this requires quite some memory, and can cause a out-of-memory crash. This is caused by the conservative amount of memory that a java process can use by default (256mb). This can be solved by providing a JVM option to increase this limit. The following example sets this maximum to 4GB of RAM.
java -Xmx4G -jar scraper.jar process ....
- Standard (j)ruby practises, bundler, warbler, etc.
Done via Makefile. This starts to build a container with a specific version of jruby. Bundler is run in the container, and finally it cuts a jar using warble. When control is returned to the Makefile, it extracts the jar from the build docker container image. (unfortunately, there is no better way than to create a container, then copy the jar, and kill that container).
Released under the AGPL-3.0 license.