Finally, a natural way to explore Elastic Net coefficients. Amai mijne rekker.
Elastic Net regression is a fantastic technique for assessing the influence of a plethora of predictors on a specific response variable. By looking at the coefficients for each predictor in an Elastic Net regression (or its siblings Ridge and Lasso regression), we can investigate the "pull" of a particular predictor on the modelled phenomenon.
Because we tend to use many predictors at once with Elastic Net-like regression, it becomes a challenge to investigate the results. It is very difficult, for example, to fit 500 data points on a normal sized plot (if we were to have 500 predictors). To combat this issue, I wrote Rekker, an interactive visualisation tool for Elastic Net coefficients.
Features:
- display all coefficients in a scrollable plot
- colour-coded data points
- zero coefficient baseline
- tooltip with exact values
- compare coefficients against other values
- compare coefficients against distributional semantics
- draw distributional clusters
Rekker supports loading files from your local hard disk. Unless you want to make changes to the code, you can simply use the live demo and load your dataset.
To host Rekker yourself, follow these steps:
-
Clone this repository
-
Serve the repository's directory using an HTTP server, for example by using the built-in Python webserver:
> python -m http.server Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
See the bottom of this README if you wish to run Rekker in Docker.
The demo dataset is coefficients.csv
in the repository's directory. A Rekker dataset should have the following structure at minimum:
coefficient,feature
-2.7109007348534733,I
-2.5738031394173366,Monday
-1.970461593962809,not
-1.864356619676603,like
This means your dataset should have a numeric "coefficient" column, and a character "feature" column.
The coefficient values will be rendered in a dot plot.
If your regression model contains features which represent simple binary or numeric values (i.e. for multifactorial control), prefix them with "_"
so they show up in the "Regression set-up" pane. You can toggle these additional features to see what effect they have on the other features in your model.
You can extend Rekker's functionality by adding extra information for each feature-coefficient pairing. If you wish to offset each coefficient against a specific numeric variable, simply add an extra numeric column. The following example shows the above example, complemented with a numeric "frequency" column.
coefficient,feature,frequency
-2.7109007348534733,I,946
-2.5738031394173366,Monday,248
-1.970461593962809,not,42
-1.864356619676603,like,147
You can add multiple numeric columns and switch between them in the Rekker interface. The columns will show up under "Coefficients + external data".
These numeric columns can also be used for colour coding by selecting the "Numeric" option under "Colour coding" and choosing a column.
The coefficient values and external data will be rendered in a scatter plot.
You can also define bidimensional numeric information for use with Rekker. To include this kind of information, use the "column.x" and "column.y" naming scheme:
coefficient,feature,mds.x,mds.y
-2.7109007348534733,I,0.1,0.3
-2.5738031394173366,Monday,0.8,-0.4
-1.970461593962809,not,0,0
-1.864356619676603,like,-0.5,0.3
Corresponding column pairs will be combined automatically in the Rekker interface. Because two dimensions now need to be visualised, the coefficient axis is lost. You can still consult the coefficient values by hovering over a data point, or by setting "positive/negative" as the group coding. You can add multiple bidimensional numeric columns and switch between them in the Rekker interface. The columns will show up under "External data only".
The external data will be rendered in a scatter plot.
You can also add clustering information to this bidimensional plot. To add clustering information, specify for each data point to which cluster it belongs. The clustering column should follow this naming convention:
- prefixed with
cluster.
- contains the name of the coordinate system it is based on
In the example below, our clustering is based on the mds.x
and mds.y
columns, hence why mds
is included in the clustering column name. If you do not add this name, the clustering column will not show up in the interface.
coefficient,feature,mds.x,mds.y,cluster.mds.kmeans
-2.7109007348534733,I,0.1,0.3,cluster 1
-2.5738031394173366,Monday,0.8,-0.4,cluster 2
-1.970461593962809,not,0,0,cluster 2
-1.864356619676603,like,-0.5,0.3,cluster 1
If formatted correctly, the available clusterings will be recognised automatically in the Rekker interface. The clusters will be drawn as polygons under your bidimensional data points.
You can also extend Rekker's functionality by adding group information information for each feature-coefficient pairing. You can stipulate for each feature to what group it belongs. The following example shows the above example, complemented with a "part of speech" column:
coefficient,feature,frequency,pos
-2.7109007348534733,I,946,pronoun
-2.5738031394173366,Monday,248,noun
-1.970461593962809,not,42,adverb
-1.864356619676603,like,147,verb
Each unique group will receive its own colour coding. You can add multiple group columns and switch between them in the Rekker interface. The columns will show up under "Colour coding" > "Categorical".
You can also load an additional dataset into Rekker which contains a model's metadata. The metadata file is based on triples of subject
, predicate
and object
:
subject,predicate,object
"model","alpha",0.4
"model","loss",0.709548657049239
"model","intercept",-0.554293130725607
"model","dev.ratio",0.464360010202276
"model","nzero",998
"model","lambda",0.00217932833240788
All model data should have "model"
as its subject value. Then, you can define what exact metadata about the model that line contains in predicate
. object
contains the actual value, as is clear from the above example.
If you add a line with "model"
as a subject
and "intercept"
as a predicate
, Rekker will allow you to display your model's intercept value and adjust for it in the interface. Other values are purely descriptive and appear in the model info pane.
You can put GAM predictions as a backdrop for the scatterplot as a heatmap. This shows you a spatial overview of semantic areas in distributional semantics. The dataset structure for a GAM heatmap is simple:
mds.x,mds.y,mds
0.25,-0.13,0.6
Your entire dataset should be composed of individual coordinates with predictions for that specific location in distributional space. Heatmaps are linked to bidimensional columns in your coefficients dataset, so the column names for these coordinates should be the same as the dimension-reduced distributional coordinates of your features (mds.x
and mds.y
in the example). The predicted value for that coordinate is the coordinate name itself (mds
in the example).
You can define your own colour palette for use in Rekker. Create a JSON file with a list of at least five hex colours:
[
"#96E637",
"#DE193E",
"#394E62",
"#FFE4E1",
"#FFFFFF"
]
The order is as follows:
- Used for negative coefficients
- Used for positive coefficients
- Used for removed coefficients
- Used for filtered coefficients
- Used as the transition colour in the heatmap
You can run Rekker in a Docker container. This will spin up an nginx webserver on your machine. There are two ways to do this:
-
Development version: if you wish to make live changes to Rekker, use the development compose file. You can change the source code while the container is running:
docker compose up
Rekker will be available at
127.0.0.1:8080
. -
Production version: if you want to host Rekker in a production environment, use the production compose file. The source code will be frozen inside the container, so live code changes are not supported.
docker compose -f "docker-compose-prod.yml" up
Rekker will be available at
127.0.0.1:80
.
You can automatically load different datasets by specifying their locations in the URL. This is especially useful if you want to share a direct link to a specific constellation of files.
Supply the locations of the different datasets as GET query parameters:
coefficients=data.csv
meta=model_info.csv
palette=colours.json
heatmap=gam.csv
e.g. my-rekker-instance.local/?coefficients=data.csv&meta=model_info.csv&palette=colours.json&heatmap=gam.csv
This assumes these datasets are stored in your Rekker directory. You can also load external datasets, as long as the server they're stored on allows shared resources under CORS.
- dynamic resizing
- more interactivity