/cs-433-project-2-lpmc_dcm

Regularized maximum likelihood estimation for discrete choice models on the LPMC dataset

Primary LanguagePython

Machine Learning: Project 2

This is the README file for the project 2 of the ML course (CS-433) we did in collaboration with EPFL Transport and Mobility Laboratory.

The dataset used is London Passenger Mode Choice (LPMC) revealed-preference data. A description of the features can be found here.

Full details of the framework, dataset, and the models it was used to develop are given in Hillel et al. (2018): https://doi.org/10.1680/jsmic.17.00018.

Pylogit package:

The code for this project is based on Python pylogit package by Timothy Brathwaite.

This package is designed for "performing maximum likelihood estimation of conditional logit models and similar discrete choice models".

Contributions to the initial code:

As part of this project, we implemented Ridge and LASSO regularization methods, as well as Box-Cox tranformations.

  • Regularization methods:

    • added file reg.py: L1() and L2() methods.

    • modified files:

      • choice_calcs.py: calc_probabilities(), [lines 144-189], calc_log_likelihood() [lines 349-364] and calc_gradient()[lines 472-615]: The gradient for a $\lambda$ parameter is given by : A remplir. Entre crochet les lignes de modifications de chaque methode

      • conditional_logit.py: fit_mle() function now taking regularization hyperparameters $\lambda_R$ (ridge) and $\lambda_L$ (lasso) as arguments. These arguments are then passed to any method that needs them.

  • Box Cox transform:

    • use of Python scipy.special.boxcox()

    • modified files

      • choice_calcs.py: calc_probabilities(), calc_log_likelihood() and calc_gradient(): see comment lines.

      • estimation.py: estimate() now specifies boundaries for boxcox parameters: $\lambda_{cox} \geq 0$. These constraints are then passed to scipy.optimize.minimize(). The minimization method is also changed from 'BFGS' to 'L-BFGS-B'.

How to use

The main.py file allows to reproduce all the results presented in the report. The main steps are:

  • Create long format data: generate_long_data.py

This file allows us to convert the original LPMC data to long format, and proceed to segmentation w.r.t age, season, travel and purpose. A boolean 'train' (True by default) specify which one of the training set or the test set is generated.

  • __Compute the parameter estimates for a given model (specification): __

After having generated a choice model by creating an ordered dictionnary (collections.OrderedDict) through the create_specification.py file, this dictionnary will determine the functionnal form of the utility functions. Then, conditional_logit.fit_mle() is called to perform the maximum likelihood estimation and returns the final log-likelihood $\mathcal{L}(\hat{\beta})$, as well as the parameters estimates $\hat{\beta}$.

  • __Perform the grid search over regularization hyperparameters: __

We perform a grid_search over $\lambda_R$ and $\lambda_L$ and store the following results to be used in the making of plots (and the evaluation of regularization part):

  • The number of parameters $\beta$ pushed towards 0. More precisely those estimated below 1e-6, between 1e-4 and 1e-6, between 1e-2 and 1e-4 and above 1e-2.

  • The array of estimated (regularized) parameters.

  • Evaluate the regularization results and compare the hyperparameters combination of the grid search: evaluation of regularization

Here, we compare the efficiency of regularization hyperparameters and plot the log-likelihood in terms of the number of added parameters. This is allowed by passing an index list to fit_mle() through the argument indd.