A collection of scripts to scrape the following factchecking sites:
- altnews: english + hindi
- quint
- boomlive: english + hindi + bangla
- vishvasnews: hindi + english + punjabi + assamese
- indiatoday
- factly: english + telugu
and social media sites:
scraping: helper functions for social media sites
sharechat_cron_scraper: refined script to scrape sharechat images
factchecking_news_sites: helper functions for factchecking sites
live_scraping: scraping all sites in one-go
live_scraping_cmd: scraping one at a time with command-line arguments
storyScraperAPI: API to query metadata from mongo
A lot of sites render most of the HTML server-side which can be scraped with the following snippet:
import requests
from lxml import html
url = 'www.mysite.com'
r = requests.get(url, headers=headers)
tree = html.fromstring(r.content)
desired_div = tree.xpath('//div')
These divs can then be organized into jsons and dumped to a database (mongodb).
Few sites (such as quint) render most content dynamically on the client, these require a more involved approach with selenium. Selenium is also used with social media sites to scroll through and load more posts. Find more details in blog
Install all packages: pip install -r requirements.txt
Geckodriver support:
download geckodriver
install firefox: sudo apt-get install xvfb firefox
Lastly fill in details in .env_template and rename as .env