DHI/tsod

Add benchmarking dataset with labelled anomalies for scoring performance of detector algorithms

halvgaard opened this issue · 12 comments

Do you know about any (open source) datasets at DHI that has labelled anomalies that we can use for testing? @ecomodeller @laurafroelich @akfDHI

@ecomodeller I found some datasets with labelled anomalies here: https://github.com/numenta/NAB
There are very few labels. But I guess that is the case with anomalies.

@Rhadhi Have you checked out the license for that repo? it seems to be quite strict and copy-left, so if we want to use material from the numenta/NAB repo we need to change our license to the same one (AGPL-3.0 License) as far as I can tell. What do you think? If I am right, making our repo AGPL would then imply that anyone using our repo would also have to make it AGPL... maybe not what we want?

I don't know any open datasets at DHI that we can use. We have to ask around and see if someone has some annotated dataset they are willing to share. There are lots of data, but not so many with labels and probably even fewer that are public, unfortunately.

I will try to ask around on DHI yammer for labelled data sets with anomalies. @ecomodeller Do you have labels for the DMI data set we have in repo? Otherwise I will try to label the obvious ones with the algorithms, e.g.
anomaly 1

@laurafroelich @ecomodeller @akfDHI How do you like this message to be posted on yammer:

We are trying to establish best practices and automated ways of identifying anomalies/outliers in time series data.
Please let us know if you:

  • have a dataset that needs to be cleaned automatically
  • have algorithms for detecting outliers lying around in your head or in actual code
  • have a data set, ideally publicly available, with labelled anomalies, i.e. an exact indication about which data points are actually anomalies.

Currently we are working on algorithms based on everything from simple range checks to machine learning models. Check out and potentially contribute to our open source anomaly detection python package on DHI's Github here: https://github.com/DHI/anomalydetection

Sounds good to me :)

Can we make an interactive application to assist the labelling process?

  1. Upload data
  2. Automatic labeling of obvious outliers with simple detector
  3. Manually add / remove labels by clicking on chart.
  4. Save the labelled timeseries in reusable format e.g. csv

@ecomodeller There is one open source tool here: https://trainset.geocene.com/

We got a labelled dataset from an actual DHI case based on groundwater measurements. Unfortunately, the dataset cannot be published publicly on github.

Can we make an interactive application to assist the labelling process?

  1. Upload data
  2. Automatic labeling of obvious outliers with simple detector
  3. Manually add / remove labels by clicking on chart.
  4. Save the labelled timeseries in reusable format e.g. csv

Please note that we now have an interactive application for labelling outliers and training a detector.