/altair-catplot

Utility to generate plots with categorical variables using Altair.

Primary LanguageJupyter NotebookMIT LicenseMIT

Altair-catplot

A utility to use Altair to generate box plots, jitter plots, and ECDFs, i.e. plots with a categorical variable where a data transformation not covered in Altair is required.

Motivation

Altair is a Python interface for Vega-Lite. The resulting plots are easily displayed in JupyterLab and/or exported. The grammar of Vega-Lite which is largely present in Altair is well-defined, well-documented, and clear. This is one of many strong features of Altair and Vega-Lite.

There is always a trade-off when using high level plotting libraries. You can rapidly make plots, but they are less configurable. The developers of Altair have (wisely, in my opinion) adhered to the grammar of Vega-Lite. If Vega-Lite does not have a feature, Altair does not try to add it.

The developers of Vega-Lite have an have plans to add more functionality. Indeed, in the soon to be released (as of August 23, 2018) Vega-Lite 3.0, box plots are included. Adding a jitter transform is also planned. It would be useful to be able to conveniently make jitter and box plots with the current features of Vega-Lite and Altair. I wrote Altair-catplot to fill in this gap until the functionality is implemented in Vega-Lite and Altair.

The box plots and jitter plots I have in mind apply to the case where one axis is quantitative and the other axis is nominal or ordinal (that is, categorical). So, we are making plots with one categorical variable and one quantitative. Hence the name, Altair-catplot.

Installation

You can install altair-catplot using pip. You will need to have a recent version of Altair and all of its dependencies installed.

pip install altair_catplot

Usage

I will import Altair-catplot as altcat, and while I'm at it will import the other modules we need.

import numpy as np
import pandas as pd

import altair as alt
import altair_catplot as altcat

Every plot is made using the altcat.catplot() function. It has the following call signature.

catplot(data=None,
        height=Undefined,
        width=Undefined, 
        mark=Undefined,
        encoding=Undefined,
        transform=None,
        sort=Undefined,
        jitter_width=0.2,
        box_mark=Undefined,
        whisker_mark=Undefined,
        box_overlay=False,
        **kwargs)

The data, mark, encoding, and transform arguments must all be provided. The data, mark, and encoding fields are as for alt.Chart(). Note that these are specified as constructor attributes, not as you would using Altair's more idiomatic methods like mark_point(), encode(), etc.

In this package, I consider a box plot, jitter plot, or ECDF to be transforms of the data, as they are constructed by performing some aggegration of transformation to the data. The exception is for a box plot, since in Vega-Lite 3.0+'s specification for box plots, where boxplot is a mark.

The utility is best shown by example, so below I present several.

Sample data

To demonstrate usage, I will first create a data frame with sample data for plotting.

np.random.seed(4288233)

data = {'data ' + str(i): np.random.normal(*musig, size=50) 
            for i, musig in enumerate(zip([0, 1, 2, 3], [1, 1, 2, 3]))}

df = pd.DataFrame(data=data).melt()
df['dummy metadata'] = np.random.choice(['poodle', 'beagle', 'collie', 'dalmation', 'terrier'],
                                        size=len(df))

df.head()
variable value dummy metadata
0 data 0 1.980946 collie
1 data 0 -0.442286 dalmation
2 data 0 1.093249 terrier
3 data 0 -0.233622 collie
4 data 0 -0.799315 dalmation

The categorical variable is 'variable' and the quantitative variable is 'value'.

Box plot

We can create a box plot as follows. Note that the mark is a string specifying a box plot (as will be in the future with Altair), and the encoding is specified as a dictionary of key-value pairs.

altcat.catplot(df,
               mark='boxplot',
               encoding=dict(x='value:Q',
                             y=alt.Y('variable:N', title=None),
                             color=alt.Color('variable:N', legend=None)))

png

This box plot can be generated in future editions of Altair after Vega-Lite 3.0 is formally released as follows.

alt.Chart(df
    ).mark_boxplot(
    ).encode(
        x='value:Q',
        y=alt.Y('variable:N', title=None),
        color=alt.Color('variable:N', legend=None)
    )

The resulting plot looks different from what I have shown here, using instead the Vega-Lite defaults. Specifically, the whiskers are black and do not have caps, and the boxes are thinner. You can check it out here.

Because box plots are unique in that they are specified with a mark and not a transform, we could use the mark argument above to specify a box plot. We could equivalently do it with the transform argument. (Note that this will not be possible when box plots are implemented in Altair.)

box = altcat.catplot(df,
                     encoding=dict(y=alt.Y('variable:N', title=None),
                                   x='value:Q',
                                   color=alt.Color('variable:N', legend=None)),
                     transform='box')
box

png

type(box)
altair.vegalite.v2.api.LayerChart

We can independently specify properties of the box and whisker marks using the box_mark and whisker_mark kwargs. For example, say we wanted our colors to be Betancourt red.

altcat.catplot(df,
               mark=dict(type='point', color='#7C0000'),
               box_mark=dict(color='#7C0000'),
               whisker_mark=dict(strokeWidth=2, color='#7C0000'),
               encoding=dict(x='value:Q',
                             y=alt.Y('variable:N', title=None)),
               transform='box')

png

Jitter plot

I try my best to subscribe to the "plot all of your data" philosophy. To that end, a strip plot is a useful way to show all of the measurements. Here is one way to make a strip plot in Altair.

alt.Chart(df
    ).mark_tick(
    ).encode(
        x='value:Q',
        y=alt.Y('variable:N', title=None),
        color=alt.Color('variable:N', legend=None)
    )

png

The problem with strip plots is that they can have trouble with overlapping data point. A common approach to deal with this is to "jitter," or place the glyphs with small random displacements along the categorical axis. This involves using a jitter transform. While the current release candidate for Vega-Lite 3.0 has box plot capabilities, it does not have a jitter transform, though that will likely be coming in the future (see here and here). Have a proper transform where data points are offset, but the categorial axis truly has nominal or ordinal value is desired, but not currently possible. The jitter plot here is a hack wherein the axes are quantitative and the tick labels and actually carefully placed text. This means that the "axis labels" will be wrecked if you try interactivity with the jitter plot. Nonetheless, tooltips still work.

jitter = altcat.catplot(df,
                        height=250,
                        width=450,
                        mark='point',
                        encoding=dict(y=alt.Y('variable:N', title=None),
                                      x='value:Q',
                                      color=alt.Color('variable:N', legend=None),
                                      tooltip=alt.Tooltip(['dummy metadata:N'], title='breed')),
                        transform='jitter')
jitter

png

Alternatively, we could color the jitter points with the dummy metadata.

altcat.catplot(df,
               height=250,
               width=450,
               mark='point',
               encoding=dict(y=alt.Y('variable:N', title=None),
                             x='value:Q',
                             color=alt.Color('dummy metadata:N', title='breed')),
               transform='jitter')

png

Jitter-box plots

Even while plotting all of the data, we sometimes was to graphically display summary statistics. We could (in Vega-Lite 3.0) make a strip-box plot, in which we have a strip plot overlayed on a box plot. In the future, you can generate this using Altais as follows.

strip = alt.Chart(df
    ).mark_point(
        opacity=0.3
    ).encode(
        x='value:Q',
        y=alt.Y('variable:N', title=None),
        color=alt.Color('variable:N', legend=None)
    )

box = alt.Chart(df
    ).mark_boxplot(
        color='lightgray'
    ).encode(
        x='value:Q',
        y=alt.Y('variable:N', title=None)
    )

box + strip

The result may be viewed here.

The strip-box plots have the same issue as strip plots and could stand to have a little jitter. Jitter-box plots consist of a jitter plot overlayed with a box plot. Why not just make a box plot and a jitter plot and then compose them using Altair's nifty composition capabilities as I did in the plot I just described? We cannot do that because box plots have a truly categorical axis, but jitter plots have a hacked "categorical" axis that is really quantitative, so we can't overlay. We can try. The result is not pretty.

box + jitter

png

Instead, we use 'jitterbox' for our transform. The default color for the boxes and whiskers is light gray.

altcat.catplot(df,
               height=250,
               width=450,
               mark='point',
               encoding=dict(y=alt.Y('variable:N', title=None),
                             x='value:Q',
                             color=alt.Color('variable:N', legend=None)),
               transform='jitterbox')

png

Note that the mark kwarg applies to the jitter plot. If we want to make specifications about the boxes and whiskers we need to separately specify them using the box_mark and whisker_mark kwargs as we did with box plots. Note that if the box_mark and whisker_mark are specified and their color is not explicitly included in the specification, their color matches the specification for the jitter plot.

altcat.catplot(df,
               height=250,
               width=450,
               mark='point',
               box_mark=dict(strokeWidth=2, opacity=0.5),
               whisker_mark=dict(strokeWidth=2, opacity=0.5),
               encoding=dict(y=alt.Y('variable:N', title=None),
                             x='value:Q',
                             color=alt.Color('variable:N', legend=None)),
               transform='jitterbox')

png

ECDFs

An empirical cumulative distribution function, or ECDF, is a convenient way to visualize a univariate probability distribution. Consider a measurement x in a set of measurements X. The ECDF evaluated at x is defined as

ECDF(x) = fraction of data points in X that are ≤ x.

To generate ECDFs colored by category, we use the 'ecdf' transform.

altcat.catplot(df,
               mark='line',
               encoding=dict(x='value:Q',
                             color='variable:N'),
               transform='ecdf')

png

Note that here we have chosen to represent the ECDF as a line, which is a more formal way of plotting the ECDF. We could, without loss of information, plot the "corners of the steps", which represent the actual measurements that were made. We do this by specifying the mark as 'point'.

altcat.catplot(df,
               mark='point',
               encoding=dict(x='value:Q',
                             color='variable:N'),
               transform='ecdf')

png

This kind of plot can be easily made directly using Pandas and Altair by adding a column to the data frame containing the y-values of the ECDF.

df['ECDF'] = df.groupby('variable')['value'].transform(lambda x: x.rank(method='first') / len(x))

alt.Chart(df
    ).mark_point(
    ).encode(
        x='value:Q',
        y='ECDF:Q',
        color='variable:N'
    )

png

This, however, is not possible when making a formal line plot of the ECDF.

An added advantage of plotting the ECDF as dots, which represent individual measurements, is that we can color the points. We may instead which to show the ECDF over all measurements and color the dots by the categorical variable. We do that using the colored_ecdf transform.

altcat.catplot(df,
               mark='point',
               encoding=dict(x='value:Q',
                             color='variable:N'),
               transform='colored_ecdf')

png

ECCDFs

We may also make a complementary empirical cumulative distribution, an ECCDF. This is defined as

ECCDF(x) = 1 - ECDF(x).

These are often useful when looking for powerlaw-like behavior in you want the ECCDF axis to have a logarithmic scale.

altcat.catplot(df,
               mark='point',
               encoding=dict(x='value:Q',
                             y=alt.Y('ECCDF:Q', scale=alt.Scale(type='log')),
                             color='variable:N'),
               transform='eccdf')

png