/PI202202-alako-data

This repository contains all the files related to project's data collection, data normalization / cleansing and database management.

Primary LanguageJupyter Notebook

PI202202-alako-data

This repository contains all the files related to project's data collection, data normalization / cleansing and database management.

  • 🎨 You can find front-end repository here
  • 🐳 You can find back-end repository here

Subjects / topics

This project includes the following college subjects: Web development, TI design and management, Artificial Intelligence.

Tech Stack

  • Python: Jupiter Notebook (pandas, numpy, nltk, langdetect, sentence-transformers) , beautiful soup, selenium web-driver, regular expressions.
  • Ruby: Selenium web-driver, regular expressions.
  • JavaScript: Puppeteer web-driver, regular expressions.

Results

  • 12122 unique videos
  • Cleaned / normalized data (See #27 for more details):

Note the vector field: