tsar_PT: Time Series Automated Re-labeling

Re-implementing TSAR with PyTorch

The name of this project is in no way an endorsement of Russian imperialism!

Author: Gentry Atkinson

Organization: Texas State University

Accurately detecting instances in datasets that have been mislabeled is a difficult problem with several imperfect solutions. Hand-reviewing labels is a reliable but expensive approach. Time series datasets present additional challenges because they are not as easily interpreted by reviewers. This paper introduces TSAR, as system for facilitating human review of a small portion of a dataset that it identifies as the most likely to be mislabeled. TSAR's use is demonstrated on real-world time series data.

Use

TSAR provides to main functions: identifying a list of instances in a dataset which are the most likely to be mislabeled and generating interpretable visualizations for each of those instances. As the name indicated TSAR is intended to be used with time series data. This may include: accelerometers, gyroscopes, ECG, EEG, temperatures, flow rates, or many other sensors. TSAR employs a 1D CNN as a tool for identifying its candidate list and as such can be used to review any dataset with a fixed input vector size but your results may vary for data domains other than time series.

Candidate List

A deep neural network is trained on an appropriate dataset and then predicts a label for each instance. The list of instances is sorted by the dot product of the predicted label and the assigned label. Predicted abels that are farther from the assigned labels will have a smaller value (an orthogonal one-hot encoded label will have a value of zero) and they go to the top of the list. TSAR will generate visualizations for some percentage (user input) of the full dataset (these are the candidate list). Values of 2% to 7% are appropriate in our experience. Less won't have much effect and more gets unweidly to review.

Visualizations

TSAR uses several tricks to make your data more interpretable. The first is to generate rich feature sets using deep feature learners (that's where the CNN comes in) and to use that feature set and tSNE to produce a scatter plot of some instances from your data (either all instances or only the two most likelty correct classes. The latter is less cluttered and is prefered by this team). Mislabeled point in this scatter plot will often fall into a miscolored region of the plot. The visualization also includes three waveforms. The first is the instance being reviewed. Second the nearest neighbor in the feature set with the SAME label, Third the nearest neighbor in the feature set with a DIFFERENT label.

This work is licensed under Creative Commons Attribution 4.0 International Public License. Please consult the license file for more information about permitted activities and requirements.