University project of the course "INTRODUZIONE ALLA DATA SCIENCE" of the computer science university of Genoa
- Perform comparative analysis between Netflix and Disney+ platforms
- Apply full data science pipeline from data collection to insights
- Hands-on practice of data science concepts and techniques
- Disney and Netflix title datasets
- Metadata like title, type, date added, genre etc.
- Enrichment data like IMDB, TMDB scores
- 4 datasets integrated provide multifaceted view
- Importing CSV datasets into dataframes
- Combining titles, enrichment and country datasets
- Handling missing values and duplicate rows
- Data transformations for analysis suitability
- Extracting year/month from dates
- Determining number of genres
- Statistical summaries of key variables
- Analysis across dimensions like certification, type etc.
- Comparative analysis across the platforms
- Hypothesis testing for distribution differences
- Constructed OLAP cube with dimensions:
- Month
- Content type
- Production country
- Slicing, filtering and aggregation capabilities
- Developed classification model using Logistic Regression
- Predict content type - movie or TV show
- Features: popularity, ratings, metadata
- 85% accuracy on test data
- Graphics for distributions, comparisons and trends
- Dashboards for slicing OLAP cube on multiple axes
- Python
- Jupyter Notebook
- Pandas
- Numpy
- Scikit-Learn
- Matplotlib
- Seaborn
Importing, cleaning and transforming medium-size datasets Identifying optimal data formats and structures Applying multivariate analysis techniques Training and evaluating classification models Using visualizations to extract insights