/ism-youtube-scraper

For Independent Study Module

Primary LanguageJupyter Notebook


Logo

Independent Study Module

An attempt to understand YouTube collaborations.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contact
  5. Acknowledgments

About The Project

Built With

(back to top)

Modules

  • YouTube API Scraper (./youtube_api_scrapper)
  • Entity Recognition Model (./collab_labelling/entity_recognition)
  • Machine Learning Modelling (./collab_labelling/machine_learning)

Getting Started

Installation

  1. Get a free API Key for YouTube Data API v3
  2. Clone the repo
    git clone https://github.com/Yoshi275/ism-youtube-scraper.git
  3. Install Python packages in a virtualenv
    pip install virtualenv
    virtualenv venv
    source venv/bin/activate
    pip install -r requirements.txt
  4. Enter your API in a new file at .\youtube_api_scrapper\.env
    const DEVELOPER_KEY = 'ENTER YOUR API';
  5. If using spaCy entity recognition model, download model
    python -m spacy download en_core_web_lg

(back to top)

Usage

Using YouTube Scraper

  1. Enter the youtube_api_scrapper folder
  2. Run python <FILE_NAME.py> with the following files:
    • channel_info_table_populator.py - gives channel information based on channel ID or username
    • video_info_table_populator.py - gives video information based on video ID
    • channels_to_ids.py - converts channel usernames into unique YouTube channel IDs
    • id_to_uploads_playlist.py - gets all videos from specified channel ID

Using Entity Recognition Model

  1. Enter the .\collab_labelling\entity_recognition folder
  2. Ensure that there is a gservice_account.json file within the folder, and an input file named according to the INPUT_CSV_FILE_NAME in entity_recognition_model.py in the same folder. The default input file name to be read is test_videos.csv
  3. In the entity_recognition_model.py, change the following variables as relevant:
    • ENTITY_RECOGNITION_MODEL - options: Model.GOOGLE_MODEL, Model.SPACY_MODEL. This determines the ER model being used
    • INPUT_CSV_FILE_NAME and OUTPUT_CSV_FILE_NAME - This determines the input and output files. The input file reads the same structure as that of the video data in OneDrive. The output file returns
    • SELECTED_COLS - This determines the names of the relevant columns, which has text that you want to run through the ER model. In the background, we simply combine all texts into one to send through the ER model.

Using Machine Learning Model

  1. Enter the .\collab_labelling\machine_learning folder
  2. Run jupyter notebook within the folder
  3. Run code on Jupyter notebook environment

(back to top)

Contact

Your Name - @icherylcode - cheryl.nqj@gmail.com.com

Project Link: https://github.com/Yoshi275/ism-youtube-scraper

(back to top)

Acknowledgments

(back to top)