Bat Exploitation Classification

Code repository for my manuscript: "Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation "

1. Web Searching

Bing_Wrapper: python script (can be run as command line application) to conduct keyword searches on the Bing API. Inputs include a csv file containing all queries to be run, with the column name 'queries'. The number (per query) and language of results to be returned can be specified
Google_wrapper: python script (can be run as command line application) to conduct keyword searches on the Google API. Inputs include a csv file containing all queries to be run, with the column name 'queries'. The number (per query), language and file type of results to be returned can be specified. Note, a maximum of 100 results per query can be returned.
Twitter_Wrapper_github: python script (can be run as command line application) to conduct keyword searches on the Twitter full archive search API. Inputs include a csv file containing all queries or Twitter usernames to be run, with the column name 'queries'. The language, time period of results to be returned can be specified. Users can also specify the number of results per query (newest first) or collect all search results within the time period. Note, this script must be edited before use to include the user's bearer token.
Crowdtangle_Wrapper: python script (can be run as command line application) to conduct keyword searches on the Bing API. Inputs include a csv file containing all queries to be run, with the column name 'queries'. The language, the type of results (photo, video etc) can be specified. Users must include a start date for their search and the number of years after that start date they wish to search.

2. Data Cleaning

keywordmatch_lang: functions to filter dataset using keyword matching and removal of non-english tests
similarity_filter: Python script (can be run as command line application) to deduplicate a set of texts using cosine-similarity and tf-idf vectorisation
webpage_download: Python script (can be run as command line application) to identify and extract texts from URLS collected using the scripts in 'websearching'. Users can either specify a folder of multiple csvs or a single file for extraction

3. Classification

active_learning: functions to carry out active learning with BERT
bert_classification_functions: contains functions to carry out model training and training size experiments with BERT
classification_functions_baselined: contains functions to carry out training size experiments with naive-bayes and neural network classifiers

4. Utils

PREPROCESS: functions to preprocess social media posts for classification
bert_chunking: functions to identify most likely relevant chunks of longer documents for classification with BERT

Bronwen-hunter/bat-exploitation-classification

Bat Exploitation Classification

1. Web Searching

2. Data Cleaning

3. Classification

4. Utils