This repository contains all the files related to project's data collection, data normalization / cleansing and database management.
This project includes the following college subjects: Web development, TI design and management, Artificial Intelligence.
- Python: Jupiter Notebook (pandas, numpy, nltk, langdetect, sentence-transformers) , beautiful soup, selenium web-driver, regular expressions.
- Ruby: Selenium web-driver, regular expressions.
- JavaScript: Puppeteer web-driver, regular expressions.
- 12122 unique videos
- Cleaned / normalized data (See #27 for more details):
- Vectorized data (See these files) for more details:
Note the vector field: