/stata-pylearn

Supervised learning algorithms in Stata

Primary LanguageStataMIT LicenseMIT

pylearn

Overview | Features | Prerequisites | Installation | Usage | To-Do | License

Supervised learning in Stata with scikit-learn

version 0.70 21jul2020

Overview

pylearn is a set of Stata modules that allows Stata users to implement many popular supervised learning algorithms - decision trees, random forests, adaptive boosting, gradient boosting, and multi-layer perceptrons (neural networks) - directly from Stata. In particular, pylearn makes use of Stata 16.0's Python integration and the popular Python library scikit-learn to interface between Stata and Python behind the scenes.

Features

Pylearn consists of five Stata functions implementing popular supervised ML algorithms:

Stata Function Name Description Related scikit-learn classes
pytree Decision trees DecisionTreeClassifier
DecisionTreeRegressor
pyforest Random forests RandomForestClassifier
RandomForestRegressor
pymlp Neural networks (multi-layer perceptrons) MLPClassifier
MLPRegressor
pyadaboost Adaptive Boosting (AdaBoost) AdaBoostClassifier
AdaBoostRegressor
pygradboost Gradient Boosting GradientBoostingClassifier
GradientBoostingRegressor

Each of these programs contains detailed internal documentation. For instance, to view the internal documentation for pyforest, type the following Stata command:

help pyforest

Prequisites

pylearn requires Stata 16+, since it relies on the Python integration introduced in Stata 16.0.

pylearn also requires Python 3.6+, scikit-learn, and pandas. If you do not have these Python libraries installed, pylearn will try to install them automatically - see the installation section below.

Installation

Installing pylearn is very easy.

  1. First, install the Stata code and documentation. You can run the following Stata command to install everything directly from this GitHub repository:
net install pylearn, from(https://raw.githubusercontent.com/mdroste/stata-pylearn/master/src/) replace
  1. Install Python, if you haven't already. Make sure that you have the required Python prerequisites installed by running the included Stata program:
pylearn, setup

If Stata cannot find your Python installation, refer to the installation guide.

To upgrade to the latest version of pylearn, simply run the following:

pylearn, upgrade

Usage

Using pylearn is simple, since the syntax for each component looks very similar to other Stata modeling commands.

Here is a quick example of a random forest regression with pylearn's 'pyforest' command:

* Load auto dataset
sysuse auto, clear

* Estimate random forest regression model, predicting price using mpg, trunk, and weight
* Train only on cars with foreign==1 and test on foreign==0
pyforest price mpg trunk weight, type(regress) training(foreign)
predict price_predicted

* We can also use Stata-like 'if' conditions to specify the training sample, but we won't get out-of-sample RMSE in that case
pyforest price mpg trunk weight if foreign==1, type(regress) 
predict price_predicted_2

* Pyforest chooses defaults automatically, but you can use any hyperparameter scikit-learn can
pyforest price mpg trunk weight, type(regress) training(foreign) max_depth(5) min_samples_split(4)

Detailed documentation and usage examples are provided with each Stata file. For instance, see:

help pyforest

Todo

The following items will be addressed soon:

  • Weights: Add support for weights
  • Post-estimation: return feature importance (where applicable)
  • Model selection: cross-validation
  • Exception handling: more elegant exception handling in Python

License

pylearn is MIT-licensed.