Code that mines data from the youtube comments section from any playlist.
Note : This is not an application or a module to be downloaded and used. It is more like a blueprint to use for mining data. I created this for my use only, but it is public to view and anyone can use it if they can figure out how to. One has to read through the code and understand it in order to use it because the code will be provided a private api key.
- This uses the Youtube Data Api. Data must be collected over the course of days, since the youtube data api has a daily quota. get_data.py is to be run each day with renewed quota until all data has been collected. The key given will be used to exhaustion each day unless there is no more data. Do not change the key and run before cleaning the state. It may result in key getting banned from using the Youtube Data API.
- For reference, quota exhausted.json contains the response given by youtube when the quota has been used and comments disabled.json contains the response given by youtube when the comments for a video are disabled.
- get_data.py does the actual data mining, cab_1729.py contains all details for what to mine.
- to_json.py converts the data to a more human readable format. It does not however convert all the data, images are also stored in the data files. Images are not converted to json.
- Data is stored in shelve files with the same name as the playlist, state shelf stores how much data has been mined, for next time.
- It is recommended to not change anything in the files before all the data has been mined, except the semaphore counts given in get_data.py
- logs directory is for storing the output log which is not done by default. I personally use tmux-logging for that.
- storage is a folder to backup the data
- .bat files are for quickly handling backup
- tests.py is for storing the tests.
- In order to use, it is almost necessary to read the code or open the data shelf using python to understand in what format the data is being stored.
- All the filtering is done natively, so filtering code is completely private. This goes through all the available comments from the playlist, so Youtube servers only get the API key, IP, and playlist id.