Blogspotscraper

A Python 3 script for scraping a Blogspot blog recursively. Saves each post as a new html file, which is cleaned from most of its html code.

Requirements

This script uses BeautifulSoup. Install with:

pip3 install beautifulsoup4

Usage

Change the url variable to the URL of the latest blog post in the blog. Save.
Run with python3 blogspotscraper.py
Abort scraping by pressing CTLC-C. Or the script will continue until there are no more posts left (or you get banned for over-using bandwidth)

Limitations

This script does NOT work with the official Google blogs hosted on blogspot. It has only been tested from a Swedish IP-number, so it might not work if some URL redirection happens.

This is just a quick and dirty script that could work as a scaffold for writing more precise scraping features.

Known error: Some Blogspot blogs have a different way of handling unique posts. If the script does not work, change the following line:

div = soup.find(id="post-body-" + findID[0]) #This retrieves each post content

div = soup.find("div", class_="post-body")

Warning!

By repeatedly downloading web pages, you might get temporarily banned from the service. Use on your own risk.

RyanBabij/blogspotscraper

Blogspotscraper

Requirements

Usage

Limitations

Warning!