molgenis/capice-resources

train_data_creator: Better handling of duplicates.

svandenhoek opened this issue · 0 comments

In case a duplicate is found (identical '#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class'), the current approach simply removes any duplicates after the first. This could mean a duplicate which would yield a higher review score (=a higher weight for training) is removed.

A better approach would be to use ensure the highest review score is kept among all sources so that high quality data is treated as such.