This repository provides the code and data to reproduce the experimental results from the paper
K. Yezzi-Woodley, A. Terwilliger, J. Li, E. Chen, M. Tappen, J. Calder, P. J. Olver. Using machine learning on new feature sets extracted from 3D models of broken animal bones to classify fragments according to break agent. arXiv preprint:2205.10430, 2022.
The data
folder contains the preprocessed csv files used in the machine learning experiments. The folder preprocessing
contains all the python scripts used to preprocess the raw data we collected into a form that can be used by machine learning algorithms. The scripts in the preprocessing folder generate the files data/break_level_ml.csv
and data/frag_level_ml.csv
, which contain the break-level and fragment-level datasets, respectively, with all data converted to numerical formats.
The file data/moclan.csv
contains the data from Moclan et al., 2019, which we compare against, and data/moclan_carngrouped.csv
contains the same data except with the carnivores grouped together.
All results are saved in .csv files in the results
folder. Likewise, many of the tables from the paper are automatically generated and saved in the tables
folder, and figures are saved in the figures
folder.
The python scripts require the installation of a number of python packages, all listed in the requirements.txt
file. To install all required packages run
pip install -r requirements.txt
All python scripts are in the src
folder. Descriptions of each script are below.
ml_test.py
runs all the main machine learning tests in the paper.Moclan_replication.py
runs all of the experiments from our paper concerned with replicating the Moclan et al., 2019 study.SpectralEmbedding.py
runs the unsupervised graph-based learning experiments, including spectral embeddings and clusterings.randomized_experiment.py
runs the experiment from our paper with applying bootstrapping and break-level train-test splits on randomized datasets.randomized_experiment_table.py
generates a table summarizing the results from the randomized experiment.utils.py
contains a variety of utility functions that are used over multiple scripts.
The utils.py
file contains functions for loading our datasets as numerical datasets that can be directly used in machine learning. To load the break-level data, run
from utils import break_level_ml_dataset
data,target,specimens,break_numbers,target_names = break_level_ml_dataset()
The first two outputs data
and target
are the features and labels for the dataset, all converted to numerical values. The final outputs specimens
, break_numbers
, and target_names
are lists of specimen names and break numbers for all datapoints (breaks) and the target names for each numerical label. The function has several optional arguments to configure how the dataset is built, and the field used for targets; see the documentation in utils.py
for more information.
To load the fragment-level dataset, run the code
from utils import frag_level_ml_dataset
data,target,specimens,target_names = frag_level_ml_dataset()
Email jwcalder@umn.edu with any questions or comments.