sandbox-data: A Python repository from Tattle

scraping

A collection of scripts to scrape the following factchecking sites:

altnews: english + hindi
quint
boomlive: english + hindi + bangla
vishvasnews: hindi + english + punjabi + assamese
indiatoday
factly: english + telugu

and social media sites:

code

scraping: helper functions for social media sites
sharechat_cron_scraper: refined script to scrape sharechat images

factchecking_news_sites: helper functions for factchecking sites
live_scraping: scraping all sites in one-go
live_scraping_cmd: scraping one at a time with command-line arguments

storyScraperAPI: API to query metadata from mongo

how it's done

A lot of sites render most of the HTML server-side which can be scraped with the following snippet:

import requests
from lxml import html

url = 'www.mysite.com'
r = requests.get(url, headers=headers)

tree = html.fromstring(r.content)
desired_div = tree.xpath('//div')

These divs can then be organized into jsons and dumped to a database (mongodb).

Few sites (such as quint) render most content dynamically on the client, these require a more involved approach with selenium. Selenium is also used with social media sites to scroll through and load more posts. Find more details in blog

installation

Install all packages: pip install -r requirements.txt

Geckodriver support:
download geckodriver
install firefox: sudo apt-get install xvfb firefox

Lastly fill in details in .env_template and rename as .env

tattle-made/sandbox-data

scraping

code

how it's done

installation