Coursera Data Science Courses: Project Overview

Scraped over 1,800 Coursera courses using Python and beautifulsoup
Engineered features from the text of each course description to quantify the emphasis courses put on Python, SQL, ML, DL, R, TensorFlow, and Google Cloud.

WIP:

Resources Used

Python Version: 3.7
Packages: pandas, numpy, json, beautifulsoup, matplotlib, seaborn, flask
Coursera Catalog API: https://build.coursera.org/app-platform/catalog/

For the core data, I used customized queries to get json data from Coursera's Catalog API. The fields included in the API database are:

From there, I filtered courses with at lease one domain listed as "data science".

I built a web scraper to scrape over 1800 DS courses on Coursera. With each course, I got the following:

Merged instructor and partner names into the main dataset
Parsed course domains and subdomains out of nested column
Converted start date from unix timestamp to datetime, then transformed into courses' age in days
Made column for number of instructors
Made columns for two types of certificates available
Transformed primary language into regular string
Made columns for technical skills listed in the job description:
- Python
- machine learning
- deep learning/neural networks
- SQL
- R studio
- Excel
- TensorFlow
- Google Cloud

The output file is courses_DS_cleaned.csv.

I examined correlations and data distributions for numerical variables, then value counts for categorical variables WIP

I structured this project following Ken Jee's tutorial on his DS salary estimator project: https://github.com/PlayingNumbers/ds_salary_proj
The idea of scraping coursera webpages stems from this kaggle dataset: https://www.kaggle.com/datasets/siddharthm1698/coursera-course-dataset. It has less fields, but includes professional certificates and specializations aside from courses.
- I referenced a few Kaggle notebooks affiliated with the above dataset: https://www.kaggle.com/datasets/siddharthm1698/coursera-course-dataset/code