/datascience001

This project aims to analyze user behavior on VIDIO.COM and provide actionable insights through data analysis and machine learning. The machine learning component compares the performance of two models: Random Forest and XGBoost, to predict playback duration.

Primary LanguageJupyter Notebook

Data Analysis and Machine Learning Model Comparison for VIDIO.COM Analytics

image

Project Description

This project performs a comprehensive data analysis on VIDIO.COM usage data to gain insights into user behavior, platform preferences, and content consumption. It leverages machine learning models, specifically Random Forest and XGBoost, to predict key metrics like playback duration. The project highlights the differences between user preferences based on platform and content types, while also comparing the effectiveness of two machine learning models for predicting playback duration. The analysis aims to provide actionable insights that can help improve user experience and marketing strategies.

File Structure

  • notebook.ipynb: Jupyter notebook containing all the code for data analysis, model training, and evaluation.
  • Link to DataSet.txt: A file containing the link to the dataset used in the analysis.
  • Insight and Story of Data.pdf: PDF document summarizing key insights, data stories, and machine learning model comparisons generated from the analysis.

Usage Guide

  • Cloning Repository: Clone this repository to your local machine with the command: git clone https://github.com/fadhiljr7/datascience001.git
  • Install dependencies: Ensure you have Python and Jupyter installed. You can install the required libraries by running: pip install -r requirements.txt
  • Open Notebook: Open notebook.ipynb in Jupyter Notebook or Google Colab to see the entire analysis process and visualization results.
  • View Insights: Use the "Insight and Story of Data.pdf" file to get a visual summary of the analysis and key insights.
  • Link to DataSet.txt : The data.csv file can be used to perform further analysis or verify results with raw data.

Technologies Used

  • Python: Programming language used for data analysis and model training.
  • Pandas and NumPy: Libraries for data manipulation and analysis.
  • Matplotlib and Seaborn: Tools for data visualization.
  • Scikit-learn: Library used for training the Random Forest model.
  • XGBoost: Library used for the XGBoost model.
  • Google Colab: Cloud-based platform for running the Jupyter notebook.

##Conclusion The analysis shows that users on mobile platforms tend to engage more with the content, and embedded video plays provide a better experience for users with limited internet connections. Machine learning model comparison reveals that Random Forest outperforms XGBoost in predicting playback duration, with XGBoost showing signs of overfitting. These insights can guide strategies for improving video quality and platform optimization, especially for mobile users.