Question about Videos
Closed this issue · 5 comments
How do you know if these videos are the ones specifically in the 8M dataset? It seems like your script only queries the categories. However, there might be more videos in that category but are not part of the selected dataset.
Thanks!
@valeriechen The script downloads only the videos specified in the youtube 8M dataset. I should have put up some information in the README explaining how it works.
Basically, the repository works in the following way (I will link this issue to README for reference):
If you try to go to this page, it displays all the 3862 classes
of the Youtube-8M dataset. On inspecting the html code, you can obtain the links to the javascript files in the google database, corresponding to each of these classes. I have stored the useful part of this in selectedcategories.txt.
Now, for the class Games
, the list of the corresponding tf-record files can be accessed via the following link:
https://storage.googleapis.com/data.yt8m.org/2/j/v/03bt1gh.js
Here, 03bt1gh
corresponds to the value for Games
in the selectedcategories.txt.
Each of the fields in the JSON array obtained by accessing the above link contains a 4 character <tf-record-id>
corresponding to each video of the Games
class. There should be a total of 788288
such ids for Games
.
Each of these ids can next be translated to the corresponding youtube-id by replacing the <tf-record-id>
in a specific way (Reference). Say the record id was 19Mn
then you need to access the following link:
https://storage.googleapis.com/data.yt8m.org/2/j/i/19/19Mn.js
to get the actual youtube-id for that tf-record id (Note: first 2 characters are repeated and appended to /
and 19Mn.js
).
Once, you get the video-id (which is cmh9FnLbE5s
in the above case), you just need to pipe it to the youtube-dl
command to download it. The actual youtube link would be:
http://www.youtube.com/watch?v=cmh9FnLbE5s
Hope that this answers your question.
PS: Thanks for pointing this out. Due to this, I checked the youtube8M website and it seems that they have updated the dataset to a newer version. Hence, I had the chance to update the repo to support the newer version.
super useful, thank you very much
good job!
Thanks for this! It's great!