Generalized Linear Models (GLM)

In this repository I delve into three different types of regression.

📖 About

This is a collection of end-to-end regression problems. Topics are introduced theoretically in the README.md and tested practically in the notebooks linked below.

First, I tested the theory on toy simulations. I made four different simulations in python, taking advantage of the sklearn and statsmodels libraries:

After that I moved onto some real-world-data cases, developing three different end-to-end projects:

Linear Regression - Human brain weights
Logistic Regression - HR dataset
Poisson Regression - Smoking and lung cancer

Further details can be found in the 'Practical Examples' section below in this README.md.

Note. A good dataset resource for linear/logistic/poisson regression, multinomial responses, survival data.
Note. To further explore feature selection: source 1, source 2, source 3, source 4, source 5.

📚 Theoretical Overview

A generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. In a generalized linear model, the outcome $\mathbf{Y}$ (dependent variable) is assumed to be generated from a particular distribution in a family of exponential distributions (e.g. Normal, Binomial, Poisson, Gamma). The mean $\mathbf{\mu}$ of the distribution depends on the independent variables $\mathbf{X}$ through the relation:

$$\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}] = \boldsymbol{\mu} = g^{-1}(\boldsymbol{X},\boldsymbol{\beta})$$

where $\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}]$ is the expected value of $\boldsymbol{Y}$ conditioned to $\boldsymbol{X}$ , $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta}$ is the linear predictor and $g(\cdot)$ is the link function. The unknown parameters $\boldsymbol{\beta}$ are typically estimated with maximum likelihood and IRLS techniques.

🟥 For the sake of clarity, from now on we consider the case of the scalar outcome, $Y$.

Every GLM consists of three elements:

a distribution (from the family of exponential distributions) for modeling $Y$
a linear predictor $\boldsymbol{X},\boldsymbol{\beta}$
a link function $g(\cdot)$ such that $\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}] = \boldsymbol{\mu} = g^{-1}(\boldsymbol{X},\boldsymbol{\beta})$

The following are the most famous/used examples.

Distribution	Support	Typical uses	$\mu=\mathbb{E}[Y\|\boldsymbol{X}]$	Link function $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = g(\mu)$	Link name	Mean function
Normal $(\mu,\sigma^2)$	$(-\infty, \infty)$	Linear-response data	$\mu$	$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \mu$	Identity	$\mu = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta}$
Gamma $(\mu,\nu)$	$(0,\infty)$	Exponential-response data	$\mu$	$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = -\mu^{-1}$	Negative inverse	$\mu = -(\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})^{-1}$
Inverse-Gaussian $(\mu,\sigma^2)$	$(0, \infty)$		$\mu$	$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \mu^{-2}$	Inverse squared	$\mu = (\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})^{-1/2}$
Poisson $(\mu)$	${0, 1, 2, ..}$	Count of occurrences in a fixed amount of time/space	$\mu$	$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\mu)$	Log	$\mu = \exp(\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})$
Bernoulli $(\mu)$	${0, 1}$	Outcome of single yes/no occurrence	$\mu$	$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\frac{\mu}{1-\mu})$	Logit	$\mu = \frac{1}{1+\exp(-\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})}$
Binomial $(n, \mu)$	${0, 1, .., n}$	Count of yes/no in $n$ occurrences	$n\hspace{1pt}\mu$	$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\frac{\mu}{1-\mu})$	Logit	$\mu = \frac{1}{1+\exp(-\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})}$

📂 Practical Examples

As already mentioned, let $Y$ be the outcome (dependent variable) and $\mathbf{X}$ be the independent variables. The three types of regression I analyzed (Linear, Logistic and Poisson) differ in the nature of $Y$. For each type, I collected an ad-hoc dataset to experiment with.

📑 Linear Regression

In the case of linear regression $Y$ is a real number and it is modeled as:

$$\begin{cases} \hspace{4pt} Y\sim N(\mu,\sigma^2)\\ \hspace{4pt} \mu = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for linear regression i analyzed a dataset of human brain weights.

📑 Logistic Regression

In the case of logistic regression $Y$ is a categorical value ($0$ or $1$) and it is modeled as:

$$\begin{cases} \hspace{4pt} Y \sim Bernoulli(\mu)\\ \hspace{4pt} \log(\frac{\mu}{1-\mu}) = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for logistic regression i analyzed an HR dataset.

For Advanced Classification techniques with Scikit-Learn check out Breast Cancer: End-to-End Machine Learning Project.

📑 Poisson Regression

In the case of poisson regression $Y$ is a positive integer (count) and it is modeled as:

$$\begin{cases} \hspace{4pt} Y \sim Poisson(\mu)\\ \hspace{4pt}\log(\mu) = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for poisson regression i analyzed a dataset of smoking and lung cancer.

⚖️ Python `sklearn` vs. `statsmodels`

What libraries should be used? In general, scikit-learn is designed for machine-learning, while statsmodels is made for rigorous statistics. Both libraries have their uses. Before selecting one over the other, it is best to consider the purpose of the model. A model designed for prediction is best fit using scikit-learn, while statsmodels is best employed for explanatory models. To completely disregard one for the other would do a great disservice to an excellent Python library.

To summarize some key differences:

OLS efficiency: scikit-learn is faster at linear regression, the difference is more apparent for larger datasets
Logistic regression efficiency: employing only a single core, statsmodels is faster at logistic regression
Visualization: statsmodels provides a summary table
Solvers/methods: in general, statsmodels provides a greater variety
Logistic Regression: scikit-learn regularizes by default while statsmodels does not
Additional linear models: scikit-learn provides more models for regularization, while statsmodels helps correct for broken OLS assumptions

leontavares/GLMs