train_data_creator: Better handling of duplicates.
svandenhoek opened this issue · 0 comments
svandenhoek commented
In case a duplicate is found (identical '#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class'
), the current approach simply removes any duplicates after the first. This could mean a duplicate which would yield a higher review score (=a higher weight for training) is removed.
A better approach would be to use ensure the highest review score is kept among all sources so that high quality data is treated as such.