hafezasg/self-training-scikit-learn-PySpark

Semi-supervised method is a class of supervised learning techniques applied on a data with small amount of labeled and large amount of unlabeled data. Here, self-training method is applied to efficiently label unlabeled data. Effect of different parameters (probability threshold, unlabeled data ratio) are investigated versus the cost of human labeling. The model is developed in both sequential (with python) and distributed (with PySpark) systems. At the end, accuracy of the developed model is compared with label propagation model from Scikit-learn package.

Jupyter Notebook

self-training-scikit-learn-PySpark