/SI699-Tutorial-Presentation

Tutorial project based on the paper using emoji to predict dropout rate

Primary LanguageRoff

SI699 Tutorial Project

Slide deck

1. Dataset

Original dataset from the repo of the paper

Filename Description
actor_emojis Emoji count in 264808 posts
emoji_duplicate_mapping Map the duplicate emojis with different utf-8 code
2018_post_type_emojis Emoji count in 6 kinds of posts
2018_posttype_posts Count of posts in 6 kinds of posts
2018_posttype_emojiposts Count of posts with emojis in 6 kinds of posts
2018_repolang_posts Count of posts in 324 languages
2018_repolang_emojiposts Count of posts with emojis in 324 languages
2018_repolang_repos Count of repos in 324 languages
2018_repolang_emojis Count of repos with emojis in 324 languages
2018_emoji_and_non_emoji_user_features 529616 user info with 64 features
active_(non_)emoji_users two groups of 67664 users with emoji and dropout info
lang_emoji_dropout dropout distribution among language

2. Code Go-through

2.1 Environment Setup

  • Python version should be higher than 3.10.0

    pip install -r requirement.txt

2.2 Descriptive Analysis

  1. Popularity Analysis

    • There are 2699 emojis and the most popular 20 are 🚀 ✅ ⬆️ ☁️ 👍 ⬇️ 🎉 🌴 🎫 ✔️ ⚠️ ❌ ⌨️ 🚨 😄 ♻️ 📺 💛 ❤️ 🔥

    • The popularity among 6 post types

      Post type Emojis Post with emoji count Post with emoji count Percentage
      Issues_comments 👍 ⌨️ 😄 🎉 🚀 💪 😉 🚨 📺 ✨ 9436402 0.033
      Issues 🚨 📺 ❌ ⚠️ ✔️ 🌴 👋 ✅ ⌨️ 👽 18485203 0.034
      Pull_reivew_comments 👍 😄 🤔 😉 ⚠️ 😅 🙂 😆 ✅ 💯 8366087 0.029
      Pull ☁️ 🎫 🚀 🌴 🎉 🚦 ♻️ 📅 🔕 🎟️ 802005 0.041
      Pull_comments ⬆️ 🚀 ✅ ⬇️ 👍 🎉 ✔️ ❌ ⚠️ 💛 11898784 0.14
      Commit comments ✅ 📜 👍 🔥 💯 😄 🦍 🎉 🔴 ®️ 13863740 0.06
  2. Emoji usage by programming languages

    • The top 20 programming languages with emoji count and entropy are

      Programming Language Emoji Count Emoji Entropy
      JavaScript 1854 3.98
      Python 1408 3.41
      Java 1096 3.38
      C++ 1250 3.68
      Go 891 3.38
      HTML 1593 3.73
      PHP 1190 4.17
      Ruby 1018 2.31
      TypeScript 1001 3.38
      C# 800 3.46
      C 717 3.69
      CSS 1154 4.14
      Shell 778 4.34
      Rust 606 3.57
      Scala 635 2.76
      Swift 741 3.55
      Objective-C 523 3.47
      PowerShell 341 1.09
      Jupyter Notebook 388 3.41
      Kotlin 459 3.6

2.3 Feature Analysis

  1. Feature selection

    • Use log and percentage method to create 64 features of users
    • Use VIF method to remove correlated features, and only leaves 45 features in total
  2. Regression Analysis

  3. Label Analysis

    • Why use dropout rate as feature

2.4 Prediction

  1. Model Setting