/legisFrance-data-engineering

the project involves scraping french legal texts data from the website "Légisfrance", storing this data in a NOSQL database and data visualization

Primary LanguagePython

legisFrance-data-engineering

This project aims to collect legal text data from the website Légisfrance, store the data and visualizing it.

Step 1 : Legal Text Scraping

The first step involves scraping the most recent 200 legal texts of nature (Arrêté, Décret or Ordonnance).

The approach followed is to get all the links of (version initiale), navigating through each of them and getting information about the legal text and its articles.

The data collected is stored in a csv file.

alt text

The code implemented to scrape data can be found in scrape.py file.

Step 2 : Data Modelling and ETL Pipeline

Data modeling

The fields retrieved are :

  • title is the title of the legal text.
  • nature is the nature of the legal text.
  • date is the signature date of the legal text.
  • NOR is the Reference Order Number of the legal text.
  • ELI is the European Legislation Identifier.
  • jorf is the Official Journal of the French Republic in which the legal text was published.
  • jorf_link is the URL of the legal text in the JORF.
  • jorf_text_num The JORF reference number for the legal text.
  • preface is the introductory text of the legal text.
  • article_title it the title of an individual article within the legal text.
  • article_text is the text of an individual article within the legal text.
  • article_link is the URL of the individual article within the legal text.
  • article_tables contains tables in html format included in the article if they exist.
  • annexe contains the appendice if included in the legal text if it exists.
  • annexe_tables any tables included in the annexes if they exist.
  • annexe_summary A summary of the content of the annexes.
  • jorf_pdf the link to PDF version of the legal text in the JORF.

The schema :



      legalText: {
        title: 'string',
        nature: 'string',
        date: 'date',
        NOR: 'string',
        ELI: 'string',
        jorf: 'string',
        jorf_link: 'string',
        jorf_text_number: 'string',
        preface: 'string',
        annexe: 'string',
        annexe_tables: 'string',
        annexe_summary: 'string',
        jorf_pdf: 'string',
        articles: [
           { 
              article_title: 'string',
              article_text: 'string',
              article_link: 'string',
              article_tables: 'string',
           }
        ]
    }

Assuming that the articles are generally queried with their respective legal texts, the legal text and its articles are put in the same collection.

Assuming that the fields that are used the most are nature and date, two indexes are created for these fields.

Also, in this part we load the data into the mongodb after performing some transformations in order to respect the structure of the mongodb collection.

The code used to load data can be found in load_to_db.py file. alt text image alt text image

We can remark that legal text of nature "Ordonnance" is signed only once during this period

ETL Pipeline

In this step we implement the data pipeline.

The code implemented to orchestrate the data pipeline can be found in this file.

alt text

STEP 3 : Data Visualization

In this step we create visualizations related to the legal text.

The first two visualizations represent respectively the number of legal texts by nature by day and the cumulative count of legal texts by nature over time.

legal_text_by_nature_over_time

Another visualization shows the average number of articles by the nature of the legal text (here it shows the number of articles for the single legal text of nature "Ordonnance").

image

This visualization show the average number of characters, words and paragraphs by the nature of the legal text.

chars_words_paragraphs

The other visualizations represent the wordcloud for content of the article texts and title texts.

We have created a stopwords list that can be improved as needed for efficient words retrieving.

image

The visualisations are stored in this folder along with the stopwords list.

These visuals can be used later for analysis ( Reporting or Dashboard ).

The DAG after adding the data visualization task :

alt text

Step 4 : Data Pipeline Monitoring

In this step we track and visualize metrics and indicators related to the performance of the data pipeline using the tools (StatsD, Prometheus and Grafana). The dashboard can be configured to be visualized in different periods of time.

These indicators are used in the dashboard, they consist of

  • Scheduler heartbeat will indicate that the airflow scheduler is working.
  • Number of Dag Runs will indicate the number of dags runs.
  • Tasks Average Durations will indicate for each task, the time the duration for its completion.
  • Tasks failure will indicate that a task has failed, if so, an alert will be fired and the user will be notified by a mail .
  • Dag Duration is a metric that will indicate the durations for the dags over the time, the user will be alerted if the average surpasses a certain limit of time.
  • DAG Run dependency Check Time is a metric used to be aware of the time taken to check for dependecies.

image

When an alert is fired

image

An email is sent to the user

image

Requirements :

  • docker desktop v4.15.0
  • docker v20.10.21
  • docker-compose v2.13.0

Usage :

  • Download the folder of the project.
  • Navigate to the folder on your marchine.
  • execute : docker-compose up --build -d , it will take some time for the first time as it will download the images and the dependecies.
  • The data pipeline is scheduled to be executed every week. However, to run the data pipeline manually, navigate to localhost:8080, turn on the DAG and trigger it.

image