scraper-french-headlines
Objective
Internet has become an increasingly influential source of news for citizens.
However, since most of them spend a small amount of time on those websites, it can be assumed that a lot of the information retained comes from the headlines more than the actual content of the article.
In order to gather data, this repository aims at scraping the headlines of the main sources of information in France.
Implementation
The most popular websites were selected (see appendix)
Other less popular but more politically biased websites have been added for comparisons
The data is output to data
. Each CSV file is formatted with a timestamp in UTC.
Implementation challenges
- Parsing webpage to extract articles (using
beautifulsoup
) - Refactor to limit code duplication with inherited classes
- Ouest France requires JavaScript: replace
requests
withselenium
andwebdriver_manager
- Automatically scrape using GitHub actions (inspired by https://simonwillison.net/2020/Oct/9/git-scraping/)
The main limitation to extend is that each website require some manual customisation because there are no obvious patterns in the html structure.
Appendix
Choosing media
Fondation Descartes
Media | % Consulted* |
---|---|
Le Figaro | 38% |
Ouest France | 26% |
France Info | 25% |
20 Minutes | 25% |
Journal des Femmes | 23% |
Le Parisien | 22% |
Le Monde | 22% |
Elle | 21% |
L’Internaute.com | 21% |
BFMTV | 21% |
Voici | 19% |
Femme Actuelle | 16% |
Actu.fr | 16% |
Doctissimo | 14% |
L’Équipe | 14% |
Capital | 14% |
Gala | 13% |
France Bleu | 13% |
01.net | 12% |
RTL | 12% |
LCI | 12% |
Challenges | 12% |
Yahoo! Actualités | 12% |
CNews | 11% |
*During the 30 days of the study, 38% of participants consulted at least one time "Le Figaro"
ACPM
Classement des Sites novembre 2021
Site | Visites totales |
---|---|
LeFigaro.fr | 161 969 906 |
Orange.fr | 136 872 873 |
Ouest-france.fr | 131 020 339 |
Tele-Loisirs.fr | 122 753 509 |
Bfmtv.com | 115 123 264 |
Franceinfo.fr | 111 545 123 |
LeMonde.fr | 95 018 956 |
L'Equipe.fr | 93 253 564 |
20minutes.fr | 78 468 892 |
LeParisien.fr | 78 125 225 |
Closermag.fr | 70 723 423 |
Actu.fr | 69 798 438 |
Voici.fr | 66 695 986 |
Femmeactuelle.fr | 59 264 537 |
Gala.fr | 55 877 237 |
Ladepeche.fr | 46 042 844 |
Boursorama.com | 45 921 243 |
Footmercato.net | 37 052 497 |
Sudouest.fr | 34 834 486 |
Midilibre.fr | 33 139 944 |
Reuters-Oxford University
Brand | Weekly use | At least 3 days per week |
---|---|---|
20 minutes online | 18% | 9% |
Other regional or local newspaper online | 15% | 8% |
Le Monde online | 14% | 6% |
France Info (public broadcaster) | 13% | 7% |
BFM TV online | 12% | 9% |
TF1 News online | 12% | 7% |
Le Parisien online | 10% | 5% |
Le Figaro online | 10% | 4% |
Brut | 10% | 4% |
M6 online | 10% | 5% |
Le HuffPost | 9% | 3% |
Yahoo! News | 8% | 3% |
Médiapart | 7% | 3% |
Cnews online | 7% | 4% |
Ouest France online | 7% | 3% |
MSN News | 7% | 4% |