Comparison of machine learning methods based on the prediction of the popularity of open source projects on the GitHub platform
The theoretical part describes the basics of machine learning and selected supervised learning algorithms: decision trees, random forests and gradient boosting. The training set from the GHTorrent website was also described.
In the practical part, on the example of predicting the popularity of projects on the GitHub platform, the full process related to working on a machine learning project was carried out – data mining, preparation of learning data, training of selected models and analysis of the obtained results. The machine learning models and their performance in the studied case were also compared.
Implementation was done using the Python programming language and its popular libraries, mainly scikit-learn and pandas.
keywords: machine learning, supervised learning, predictions, Python, pandas, scikit-learn, , GtiHub, GHTorrent
Python | 3.7.6 | Conda | 4.9.2 |
IPython | 7.12.0 | Pandas | 1.1.1 |
Scikit-learn | 0.21.3 | Matplotlib | 3.1.1 |
SQLAlchemy | 1.3.13 | Imbalanced-learn | 0.7.0 |
The number of new stars in the given month
- for regression predicting a specific value
- for classification, predicting one of the predefined classes:
class | the number of new stars in the given month |
---|---|
0 | 0 |
1 | [1; 20) |
2 | [20; 50) |
3 | [50; 100) |
4 | 100+ |
REGRESSION | CLASSIFICATION |
---|---|
DecisionTreeRegressor |
DecisionTreeClassifier |
RandomForestRegressor |
RandomForestClassifier |
GradientBoostingRegressor |
GradientBoostingClassifier |
For more details, please contact me by e-mail.