/great-firewall-notebooks

Exploring automated searches and scraping of Google and Baidu in support of the Firewall Cafe project.

Primary LanguageJupyter Notebook

These notebooks are prototypes, research, and sanity checks for the Firewall Cafe project.

Setup

Install these packages at a minimum:

  • Jupyter Notebooks (or the Anaconda stack)

For some of them, you'll need:

  • Selenium
  • Google Cloud Translate
  • ipyplot

If you want to run those notebooks, you'll need to set up some credentials with Google Cloud Translation and you'll need to download the appropriate Chrome webdriver for your version of Chrome.

Prototyping a scraper

1_requests-google-baidu. Reverse-engineering search results.

2_using-google-cloud-translation. Getting some basic automatic translation with Google Translate.

3_compare-languages-Google. Comparing what search results look like in different languages on Google.

4_compare-languages-Baidu. Comparing what search results look like in different languages on Baidu.

5_querying-many-sensitive-words-archive. Testing rate limits to see if Google or Baidu have automatic ban-hammers at a certain rate.

API integration

6_firewall-api. Testing Firewall Cafe API endpoints and demonstrating their use.

7_firewall-babelfish. Demonstrating how to use the Babelfish translate API (if you have a key).

8_image-hashing. Testing different image hashing algorithms.

9_wordpress-node-APIs. Looking at similarities between the old and new Firewall Cafe APIs.

Migrations

10_transfer-images-http. A first attempt at getting 10k images from one place to another.

11_extract-images-postgres-dump. Extracting images from a postgresql dump; never got it working.

Data integrity checks

12_data-integrity. Checking that search results are getting entered correctly into the API, and returning as expected when we ask for them.

13_clean-up-searches-API. Delete searches that incorrectly stored way too many images.

14_wordpress-and-db-check. Take a closer look at Wordpress API vs new API to see if there are discrepencies in image results (they all seem to match).