gsssrao/youtube-8m-videos-frames

Question about Videos

Closed this issue · 5 comments

How do you know if these videos are the ones specifically in the 8M dataset? It seems like your script only queries the categories. However, there might be more videos in that category but are not part of the selected dataset.

Thanks!

@valeriechen The script downloads only the videos specified in the youtube 8M dataset. I should have put up some information in the README explaining how it works.

Basically, the repository works in the following way (I will link this issue to README for reference):

If you try to go to this page, it displays all the 3862 classes of the Youtube-8M dataset. On inspecting the html code, you can obtain the links to the javascript files in the google database, corresponding to each of these classes. I have stored the useful part of this in selectedcategories.txt.

Now, for the class Games, the list of the corresponding tf-record files can be accessed via the following link:
https://storage.googleapis.com/data.yt8m.org/2/j/v/03bt1gh.js
Here, 03bt1gh corresponds to the value for Games in the selectedcategories.txt.

Each of the fields in the JSON array obtained by accessing the above link contains a 4 character <tf-record-id> corresponding to each video of the Games class. There should be a total of 788288 such ids for Games.

Each of these ids can next be translated to the corresponding youtube-id by replacing the <tf-record-id> in a specific way (Reference). Say the record id was 19Mn then you need to access the following link:
https://storage.googleapis.com/data.yt8m.org/2/j/i/19/19Mn.js
to get the actual youtube-id for that tf-record id (Note: first 2 characters are repeated and appended to / and 19Mn.js).

Once, you get the video-id (which is cmh9FnLbE5s in the above case), you just need to pipe it to the youtube-dl command to download it. The actual youtube link would be:
http://www.youtube.com/watch?v=cmh9FnLbE5s

Hope that this answers your question.

PS: Thanks for pointing this out. Due to this, I checked the youtube8M website and it seems that they have updated the dataset to a newer version. Hence, I had the chance to update the repo to support the newer version.

super useful, thank you very much

good job!

@gsssrao just wanted to thank you for this repository! Thanks a lot!

wsuen commented

Thanks for this! It's great!