False Positive Advertisement Identification for Some Videos
aallbrig opened this issue · 3 comments
Issue
The program identifies segments of video as advertisements when it is not. I've only tested Real Engineering's videos but it seems that every almost every video tested has a false positive.
Results
[x] Checkmark indicates false positive (bad), (blank) [ ] checkmark indicates true positive (good)
- How Shazam Works
[x] https://youtu.be/kMNSAhsyiDg?t=267
[x] https://youtu.be/kMNSAhsyiDg?t=304
[x] https://youtu.be/kMNSAhsyiDg?t=311
[x] https://youtu.be/kMNSAhsyiDg?t=345
[x] https://youtu.be/kMNSAhsyiDg?t=530
[ ] https://youtu.be/kMNSAhsyiDg?t=588
[ ] https://youtu.be/kMNSAhsyiDg?t=596
[ ] https://youtu.be/kMNSAhsyiDg?t=602 - Britain's Most Daring WW2 Raid
[x] https://youtu.be/52VQdt0-5EQ?t=91
[ ] https://youtu.be/52VQdt0-5EQ?t=755
[ ] https://youtu.be/52VQdt0-5EQ?t=762
[ ] https://youtu.be/52VQdt0-5EQ?t=768 - The Truth About Vinyl - Vinyl vs. Digital
[ ] https://youtu.be/lzRvSWPZQYk?t=0
[ ] https://youtu.be/lzRvSWPZQYk?t=782
[ ] https://youtu.be/lzRvSWPZQYk?t=826
[ ] https://youtu.be/lzRvSWPZQYk?t=831 - How Machine Learning is Fighting Cancer
[x] https://youtu.be/ALQ_RNSRE40?t=26
[x] https://youtu.be/ALQ_RNSRE40?t=141
[x] https://youtu.be/ALQ_RNSRE40?t=545
[ ] https://youtu.be/ALQ_RNSRE40?t=782
[ ] https://youtu.be/ALQ_RNSRE40?t=789
[ ] https://youtu.be/ALQ_RNSRE40?t=796 - NASA's 150 Million Dollar Coding Error
[x] https://youtu.be/CkOOazEJcUc?t=275
[ ] https://youtu.be/CkOOazEJcUc?t=363
[ ] https://youtu.be/CkOOazEJcUc?t=381
[ ] https://youtu.be/CkOOazEJcUc?t=386
[ ] https://youtu.be/CkOOazEJcUc?t=392
[ ] https://youtu.be/CkOOazEJcUc?t=408 - Designing The Fastest Wheels in History
[ ] https://youtu.be/mPshhkYpCBY?t=9
[x] https://youtu.be/mPshhkYpCBY?t=77
[ ] https://youtu.be/mPshhkYpCBY?t=475
[ ] https://youtu.be/mPshhkYpCBY?t=514
[ ] https://youtu.be/mPshhkYpCBY?t=544
[ ] https://youtu.be/mPshhkYpCBY?t=549
[ ] https://youtu.be/mPshhkYpCBY?t=555 - Fastest Car vs. Fastest Helicopter - Which is Faster?
[x] https://youtu.be/Ua2_IR2CkZU?t=219
[ ] https://youtu.be/Ua2_IR2CkZU?t=452
[ ] https://youtu.be/Ua2_IR2CkZU?t=469
[ ] https://youtu.be/Ua2_IR2CkZU?t=477
Replicate:
Format:
$ id="<<id of youtube video>>"
$ results=$(python predict.py -i $id | tail -n1 | sed s/\'/\"/g)
$ # Convert DD:HH:SS into MS format; generate URLs to timestamps in video
$ urls=$(echo $results | jq '.[].start' | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }' | xargs -I{} echo "https://youtu.be/$id?t={}")
$ # Open each URL to manually verify
$ echo $urls | xargs -I{} open {}
Sample Run Example
$ id="kMNSAhsyiDg"
$ results=$(docker run py-sponsorship-remover python /scripts/predict.py -i $id | tail -n1 | sed s/\'/\"/g)
$ urls=$(echo $results | jq '.[].start' | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }' | xargs -I{} echo "https://youtu.be/$id?t={}")
$ # Open each URL to manually verify
$ echo $urls | xargs -I{} open {}
Tips & Tricks
Generate checkmark markup for this issue and copy into clipboard:
echo $urls | xargs -I{} echo " [ ] [{}]({})" | pbcopy
See beautified "results" using jq
to ensure millisecond conversion is correct:
echo $results | jq
Thank you for your time to test everything, I really appreciate it.
There certainly is a fair share of false positives. I think we can fix it by updating the dataset, at the moment there is only 200 odd examples of sponsorship (and the rest is just garbage copied from the excess and clipped at a certain amount of words).
I think refining and adding to the training dataset should fix this, although I would love to hear your thoughts.
I think refining and adding to the training dataset should fix this
@98mprice I concur. When/if I loop back to this I could possibly help refine data.csv
and/or think up some programmatic before/after verification tests so we can see if the detection is getting better or worse.
Full disclosure: I've only toyed with tensorflow for less than 30 minutes so I'll probably need some time to get up to speed. This seems like a great project to learn :)
Awesome! That sounds like a plan.
Also, the project uses Keras (on top of tensorflow) so hopefully that'll make it easier to jump in.