micah5/sponsorship_remover

False Positive Advertisement Identification for Some Videos

aallbrig opened this issue · 3 comments

Issue

The program identifies segments of video as advertisements when it is not. I've only tested Real Engineering's videos but it seems that every almost every video tested has a false positive.

Results

[x] Checkmark indicates false positive (bad), (blank) [ ] checkmark indicates true positive (good)

Replicate:

Format:

$ id="<<id of youtube video>>"
$ results=$(python predict.py -i $id | tail -n1 | sed s/\'/\"/g)
$ # Convert DD:HH:SS into MS format; generate URLs to timestamps in video
$ urls=$(echo $results | jq '.[].start' | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }' | xargs -I{} echo "https://youtu.be/$id?t={}")
$ # Open each URL to manually verify
$ echo $urls | xargs -I{} open {}

Sample Run Example

$ id="kMNSAhsyiDg"
$ results=$(docker run py-sponsorship-remover python /scripts/predict.py -i $id | tail -n1 | sed s/\'/\"/g)
$ urls=$(echo $results | jq '.[].start' | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }' | xargs -I{} echo "https://youtu.be/$id?t={}")
$ # Open each URL to manually verify
$ echo $urls | xargs -I{} open {}
Tips & Tricks

Generate checkmark markup for this issue and copy into clipboard:

echo $urls | xargs -I{} echo "    [ ] [{}]({})" | pbcopy

See beautified "results" using jq to ensure millisecond conversion is correct:

echo $results | jq

Thank you for your time to test everything, I really appreciate it.

There certainly is a fair share of false positives. I think we can fix it by updating the dataset, at the moment there is only 200 odd examples of sponsorship (and the rest is just garbage copied from the excess and clipped at a certain amount of words).

I think refining and adding to the training dataset should fix this, although I would love to hear your thoughts.

I think refining and adding to the training dataset should fix this

@98mprice I concur. When/if I loop back to this I could possibly help refine data.csv and/or think up some programmatic before/after verification tests so we can see if the detection is getting better or worse.

Full disclosure: I've only toyed with tensorflow for less than 30 minutes so I'll probably need some time to get up to speed. This seems like a great project to learn :)

Awesome! That sounds like a plan.

Also, the project uses Keras (on top of tensorflow) so hopefully that'll make it easier to jump in.