Bat Exploitation Classification

Code repository for my manuscript: "Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation "

1. Web Searching

  • Bing_Wrapper: python script (can be run as command line application) to conduct keyword searches on the Bing API. Inputs include a csv file containing all queries to be run, with the column name 'queries'. The number (per query) and language of results to be returned can be specified
  • Google_wrapper: python script (can be run as command line application) to conduct keyword searches on the Google API. Inputs include a csv file containing all queries to be run, with the column name 'queries'. The number (per query), language and file type of results to be returned can be specified. Note, a maximum of 100 results per query can be returned.
  • Twitter_Wrapper_github: python script (can be run as command line application) to conduct keyword searches on the Twitter full archive search API. Inputs include a csv file containing all queries or Twitter usernames to be run, with the column name 'queries'. The language, time period of results to be returned can be specified. Users can also specify the number of results per query (newest first) or collect all search results within the time period. Note, this script must be edited before use to include the user's bearer token.
  • Crowdtangle_Wrapper: python script (can be run as command line application) to conduct keyword searches on the Bing API. Inputs include a csv file containing all queries to be run, with the column name 'queries'. The language, the type of results (photo, video etc) can be specified. Users must include a start date for their search and the number of years after that start date they wish to search.

2. Data Cleaning

  • keywordmatch_lang: functions to filter dataset using keyword matching and removal of non-english tests
  • similarity_filter: Python script (can be run as command line application) to deduplicate a set of texts using cosine-similarity and tf-idf vectorisation
  • webpage_download: Python script (can be run as command line application) to identify and extract texts from URLS collected using the scripts in 'websearching'. Users can either specify a folder of multiple csvs or a single file for extraction

3. Classification

  • active_learning: functions to carry out active learning with BERT
  • bert_classification_functions: contains functions to carry out model training and training size experiments with BERT
  • classification_functions_baselined: contains functions to carry out training size experiments with naive-bayes and neural network classifiers

4. Utils

  • PREPROCESS: functions to preprocess social media posts for classification
  • bert_chunking: functions to identify most likely relevant chunks of longer documents for classification with BERT