factchecking-sites-scraper: A Jupyter Notebook repository from Tattle

Why Does This Exist?

One of Tattle's goals is to make stories fact-checking content circulated on chat apps and social media more accessible to mobile first users. To make the content accessible, Tattle wants the content to be discoverable through image search and video search. To implement search, Tattle needs the multi-media content inside the stories from fact-checking sites, linked with the sites that it is coming from.

Introduction

This repository contains collection of scripts to scrape the factchecking sections of the following sites:

altnews: english + hindi (IFCN expired)
quint
boomlive: english + hindi + bangla
vishvasnews: hindi + english + punjabi + assamese
indiatoday
factly: english + telugu
factchecker.in: english (IFCN expired)
newsmobile.in:english
newsmeter.in
digiteye.in
thelogicalindian.com: english
youturn.in: english
newschecker.in: english + hindi + bengali + gujarati + marathi + punjabi + malayalam + tamil

At present Tattle only scrapes IFCN certified fact-checking sites. See factchecting_sites_status.md for the updated status on each of the websites.

Running Locally:

Prerequisites:

Python Libraries: Install all packages: pip install -r requirements.txt
Geckodriver support:
download geckodriver
install firefox: sudo apt-get install xvfb firefox
Data Storage:
- A MongoDB database where all the content scraped can be pushed.
- An AWS S3 bucket on which images, videos and other multimedia items can be pushed.

The code can be amended so that content is written to a local folder (instead of an S3 bucket). For conciseness, those steps have been excluded from this documentation. If you need help doing that, please reach out to us (See section on 'Get Help with Developing')

Steps to Run:

This scraper has gone through multiple iterations and has different implementations for different fact checking sites.

For Boomlive, digiteye and newsmeter, you should run the script for each of the three sites independently. The script for each site can be found in the scraper_v3 folder. This scraper handles scraping the sites, downloading and uploading images to s3, and uploading the text and metadata for each article to mongo.
For Visvhasnews run scraping/scrape_data.py This scraper handles scraping the site and uploading text and metadata for each article to mongo. It also uses a pickle file.
For Newsmobile, Factly and IndiaToday run scraping/live_scraping.py
- This step will scrape the sites and upload the content from fact checking sites as per this data structure to the mongo DB. At this stage the images/videos have not been uploaded to s3. Only the url of these items on the fact checking sites is retrieved.
- Run Upload_to_s3.py This retrieves the URLs for items on the fact checking sites, downloads them to an s3 bucket and updates the MongoDB with an s3 link.
- Run Register_to_portal.py (optional) This step registers the media items to Tattle kosh. If you don't have credentials to write to the kosh, skip this step.
- Make sure the sites you want the scraper run for during the day are not commented out in live_scraping.py/scraper_data.py

For each of these scrapers:

Add your s3, Mongo DB credentials to a .env file which should be in the folder that contains the scraper.

Request Access

If you want access to the fact-checking sites data please fill out this form. If you have a specific requirement not covered by this form, please ping us on Slack.

Generating the Fact-Checking Sites Dashboard

The data collected from the scrapers is used to generate the weekly fact-checking sites dashboard: https://services.tattle.co.in/khoj/dashboard/

The instructions to generate the dashboard can be found in the data-experiments repository.

Contribute

Please see instructions here.

Get help with developing

Join our Slack channel to get someone to respond to immediate doubts and queries.

Want to get help deploying it into your organisation?

Licence

When you submit code changes, your submissions are understood to be under the same licence that covers the project - GPL-3. Feel free to contact the maintainers if that's a concern.

tattle-made/factchecking-sites-scraper