GenderedNews - gender bias dashboard
GenderedNews is a dashboard of gender biases in French news created with Python, MongoDB and Metabase. Check out the website!
Getting Started
Setup
To setup this project, please refer to the initial setup guide.
Usage (simple)
Here is how to use the examples the simplest way:
# Fill the database with fake article data
python3 examples/example_fake_data.py
# Fill the database with yesterday articles from Le Monde
python3 examples/example_rss_extract_store.py
To see the results you can setup a Metabase dashboard connected to the database.
Usage (with cron job)
Here is how to setup a daily cron job at 01:00 (change script.py
to the the desired script):
# Open the cron config file
crontab -e
# Add the following line in the config file:
0 1 * * * cd /path/to/genderednews/ && /path/to/genderednews/env/bin/python3 /path/to/genderednews/main_local.py
# See the cron config file
crontab -l
This is based on the following folder structure (non exhaustive):
~/
└── genderednews/
├── current -> versions/2021-XX-XX
├── versions/
│ ├── 2020-XX-XX/
│ | └── script.py
│ └── 2021-XX-XX/
│ └── script.py
├── shared/
└── logs/
Usage (script main_local.py)
In step 1, there are 2 methods for scraping articles links, one is via rss feeds and the other is via twitter.
# if you want to scrape via rss feeds
collector = collector(scraping_mode = 'rss')
# if you want to scrape via twitter
collector = collector(scraping_mode = 'twitter')
The step 3 will check if there is any articles with missing process. If the parameter 'fix' is set on 'True', all articles with missing process will be processed again and updated in the database.
Built with
A list of the main technologies used within the project (see requirements.txt
for full dependency list):
- Main tools:
- Main libraries:
- BeautifulSoup v4.9.3 - Parse HTML
- Dotenv v0.15.0 - For .env files
- Faker v8.1.2 - Generate fake data
- Feedparser v6.0.2 - Parse RSS feeds
- Newspaper3k v0.2.8 - Parse articles
- PyMongo v3.11.3 - Database driver for Python
- Tweepy v3.10.0 - Connect, parse tweets via twitter api
- Others:
Improvements
- The Quotation Extraction model of this project will soon be replaced from a rule-based system to a ML model!
Data
The data was downloaded from public websites of newspapers only for non-commercial and research purposes.
List of news sources:
- Aujourd’hui en France (édition nationale du Parisien) : https://www.leparisien.fr/
- La Croix : https://www.la-croix.com/
- Le Figaro : https://www.lefigaro.fr/
- Le Monde : https://www.lemonde.fr/
- Libération : https://www.liberation.fr/
- L'Équipe : https://www.lequipe.fr/
- Les Échos: https://www.lesechos.fr/
Mentions/Quotes
The data will permit to calculate the masculinity rates in mentions and quotes which will be represented by graphs on our website.
Similar projects
The Canadian project GenderGapTracker (source) has the same goal but for Canadian news.
License
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE
file for details.
Contact
For more information about the research methodology and for questions regarding collaboration, please contact: francois.portet@imag.fr, gilles.bastin@iepg.fr or ange.richard@univ-grenoble-alpes.fr