pysal/splot

Package Structure and API design

slumnitz opened this issue · 9 comments

Ideas and decisions regarding the design of package structure and API for splot

Overall package idea

  1. support PySAL with lightweight plotting functionality:
    • .plot methods for objects called from splot and
    • functionality found under splot.sub_package namespace depending on PySAL object that is plotted
  2. Based on matplotlib and geopandas
  3. splot.mapping offers tools for coropleth, color and mapping support

Decisions (indicating implementation status):

  • integration of splot.plot.py into splot._viz_mpl.py
  • integration of giddy visualisations in splot
  • update and integrate functionality from mapping.py into splot._viz_mpl.py
  • .mpl functions return fig, ax
  • API Structure of splot (See API - Package Structure comment):
    • hierarchy with submodules named after statistics visualised (splot.giddy), common _viz_utils.py for utility functions used by all subpackages
    • integrating splot functionality as .plot methods in sub packages (see #10 for supported functionality
  • degree of customisation user can access: high degree of customisation through keyword dictionaries
  • dependency on:
    • Matplotlib only
    • Geopandas
    • Seaborn
  • documentation in accordance to general PySAL [submodule contract](url
    http://pysal.org/getting_started.html#submodule-contract)

To be decided:

Open questions/ ideas:

Rejected options:

  • API Structure of plot (See API - Package Structure comment):
    • hierarchy with submodules, one per plotting package (now splot.mpl and splot.bk), chose function names according to statistics they are helping to visualise (e.g. moran_scatterplot(), local_autocorrelation_statistics()...)
  • API Structure of plot (See API - Package Structure comment):
    • same implementations of visualisations in each (e.g. .mpl) submodule:
      * e.g. mplot, plot_choropleth, plot_local_autocorrelation,... for bk and mpl
      * except if one plotting package doesn't have the needed features (e.g. if interactivity is essential, matplotlib may not be able to create such a plot)
  • keywords common to all functions in the same order:
    • e.g. ...(moran_loc, df, attribute, p, region_column, plot_width, plot_height,...)
    • for splot.mpl only: ax
    • for splot.bk only: reverse_colors
  • plot(method="interactive") option, more flexibility by integrating submodules instead of function calls

API - Package Structure

OPTION 1:

from splot.mpl import moran_scatterplot
from splot.bk import moran_scatterplot

Pro:

  • User can choose to install either bokeh, or mpl, or both
  • functions are found by function name, it is clear where to look, no confusion about
    e.g. why lisa_cluster() is in esda and not in giddy, even if it is also used in giddy visualisation
  • Easy to add on other visualization packages like Altair (in the fast paced visualization space)
    e.g from splot.altair import choropleth

Con:

  • Challange in choosing function names, so the user can easily see which visualisations to use for which subpackage
  • Question of consistency of parameters used in functions, e.g. different parameters used for visualisations in different subpackages.

OPTION 2:

from splot.giddy import directional_lisa
from splot.esda import directional_lisa
from splot.utility import ...

With options:

splot.backend()

if backend.lower() == 'bokeh': 
    from splot.whatever import plot_bokeh_stuff
    plot_bokeh_stuff
elif backend.lower() == 'matplotlib':
    from splot.whatever import plot_matplotlib_stuff
    plot_matplotlib_stuff
moran_scatterplot(backend='bk')
  • function used in multiple subpackages assigned to multiple subpackages

Pro:

  • Visualisation functions within representing a subpackage giddy are called with similar/ the same parameters, therefore the namespaces splot.esda (mpl version) would be more consistent than splot.mpl (containing all functions)

Con:

  • Most statistical analysis seem to use more than one python subpackage,
    resulting in a need to import multiple visualisation subpackages
    question if it is necessary to subdivide?

Resolved with additional options(see above)

  • How to design access to interactive or static functionality?
    • return to moran_scatterplot(backend='bk')?
    • installation of both bokeh, mpl (and more) needed
  • Challenging to decide which function is found under which namespace, if visualisations are used for multiple purposes
  • If only one PySAL subpackage is used, it is clear where to find it's visualisations

OPTION 3:

from splot.giddy.bk import ...
from splot.esda.mpl import ...
from splot.utility import e.g. mapclassify, legendgram ...

Pro:

  • It's as modular as it can be, as it differenciates by functionality (which pysal library the vis relates to) and by technology (the backend to use).
  • It comes with a structure more or less defined that aligns directly with the pysal structure and with the vis greater eco-system in Python, putting the former first. Every visualisation will be hierarchically linked to the submodule where it adds functionality to (or where it gets its analytics from). This way, if somebody is not familiar with splot but knows pysal and knows they want to visualise output from a particular submodule, it's (more or less) intuitive to find it.
  • It allows to keep a single higher-level structure and add more backends as we move along. With this approach, if you add functionality based on vega, that automatically is located in the library based on what the functionality adds, you don't have to replicate the tree under a new submodule (e.g. splot.vega.giddy).

Con:

  • It's very very modular, so from the start it'd have a relatively complex set of files, hierarchies, etc. However, I think it'd be accommodating of many cases moving forward, with very little additional effort.
  • It forces us to think from the beginning, and every time, what the hierarchy of functionality is. As @slumnitz says "most statistical analysis seem to use more than one python subpackage,
    resulting in a need to import multiple visualisation subpackages". For example, directional LISA might use code from esda for the lisa and core for a choropleth. The lineage there is clear (basic rendering in core, choropleth shading in esda, directional functionality in giddy), but this network of dependencies might not always be as straightforward. I'll say though, I think this is a feature not a bug in that it'll help us think through also how functionality is structured among the greater PySAL eco-system 😜

OPTION 4:

from splot.dynamics import (rose diagrams, spaghetti plots,
                            lisa markov plots...)
from splot.relation import (Moran_scatterplot, bivariate choropleth)
from splot.description import (Choropleth)

Main difference: Conceptual structure for plots
Other variations:

  • scatterplots, mapping, composite, utility, ...

Pros:

Cons:

Note:

To date, underlying .py file structure will be separated into subpackages, or proposed pysal structure (meta2), and backend:

  • e.g. _viz_lib_mpl.py, _viz_lib_bokeh.py
  • e.g. _viz_explore_mpl.py, _viz_explore_bokeh.py

Designing Functions for Splot - Parameters and Returns

OPTION 1:

Main difference: Moran_local calculated in plotting function
Consistent for all functions

Parameters

Parameters mpl Bokeh
main params ESDA gdf, attribute, w gdf, attribute, w
main params Giddy gdf, timex, timey, w tba
Significance parameters p
Choro parameters method method, k
Interactivity region_column, mask, mask_color, quadrant hover_poly_id
Figure design legend, cmap, alpha, lgend_kwds reverse_color
---------------------------- ------------------------------------------- -------------------
Returns fig, ax fig

mpl function examples:

moran_scatterplot(gdf, attribute, w,
                  p=0.05, ax=None,
                  alpha=0.6)

lisa_cluster(gdf, attribute, w,
             p=0.05, ax=None,
             legend=True, legend_kwds=None)

plot_local_autocor(gdf, attribute, w,
                   p=0.05,
                   region_column=None, mask=None,
                   mask_color='#636363', quadrant=None,
                   method='Quantiles',
                   legend=True, cmap='YlGnBu')
space_time_heatmap(gdf, timex, timey, w,
                   p=0.05, ax=None)

def space_time_correl(gdf, timex, timey, w
                      p=0.05)

Bokeh function examples:

plot_choropleth(gdf, attribute,
                method='quantiles', k=5,
                reverse_colors=False, 
                hover_poly_id='')

moran_scatterplot(gdf, attribute, w,
                  p=0.05,
                  hover_poly_id='')

lisa_cluster(gdf, attribute, w,
             p=0.05,
             hover_poly_id='')

plot_local_autocor(gdf, attribute, w,
                   p=0.05,
                   method='quantiles', k=5,
                   reverse_colors=False,
                   hover_poly_id='')

OPTION 2:

Main difference: moran_locasl as input
A: Consistent for all functions
B: simplyfied for composite plots

Parameters

Parameters mpl Bokeh
main params ESDA gdf, attribute, moran_loc, gdf, attribute, w
main params Giddy gdf, moran_loc1, moran_loc2, w, y1, y2 tba
Significance parameters p
Choro parameters method method, k
Interactivity region_column, mask, mask_color, quadrant hover_poly_id
Figure design legend, cmap, alpha, lgend_kwds reverse_color
---------------------------- ------------------------------------------- -------------------
Returns fig, ax fig

mpl function examples:

moran_scatterplot(moran_loc,
                  p=0.05, ax=None,
                  alpha=0.6)

lisa_cluster(gdf, moran_loc,
             p=0.05, ax=None,
             legend=True, legend_kwds=None)
plot_local_autocor(gdf, attribute, moran_loc,
                   p=0.05,
                   region_column=None, mask=None,
                   mask_color='#636363', quadrant=None,
                   method='Quantiles',
                   legend=True, cmap='YlGnBu')

space_time_heatmap(moran_loc1, moran_loc2,
                   p=0.05, ax=None)

A:

space_time_correl(gdf, moran_loc1, moran_loc2, w, y1, y2,
                  p=0.05)

B:

space_time_correl(gdf, timex, timey,
                  p=0.05)

Bokeh function examples:

plot_choropleth(gdf, attribute,
                method='quantiles', k=5,
                reverse_colors=False, 
                hover_poly_id='')

moran_scatterplot(moran_loc,
                  p=0.05,
                  hover_poly_id='')

lisa_cluster(gdf, moran_loc
             p=0.05,
             hover_poly_id='')

plot_local_autocor(gdf, attribute, moran_loc,
                   p=0.05,
                   method='quantiles', k=5,
                   reverse_colors=False,
                   hover_poly_id='')

Notes:

  • title, x and y axis labels can be added to both mpl and bokeh versions
    eg. Bokeh:
    fig = figure()/ lisa-cluster(...)
    t = Title()
    t.text = 'Lisa Cluster Map'
    fig.title = t
    

Discussion points:

  • How many 'figure design' options (title, xlable, ylable...) to include in
    - [ ] Bokeh
    - [ ] mpl
  • Which option (1, 2A, 2B) for continuous design of functions? (For now)
  • function design connected to overall API package design:
    • inputs consistent for all functions
    • inputs consistent for visualisation functions in specific subpackages

This is fantastic @slumnitz!!! Thank you very much for taking the plunge and have a first go!

On the general API design, my sense would be to have a hierarchical structure that begins with the pysal submodule (concept first) and continues with backend next (tool second). I guess this is an Option 5 🤣, but closest to your Option 3 I think:

from splot.esda.mpl import Moran_scatterplot
from splot.giddy.bk import Rose_plot

Pro

  • It's as modular as it can be, as it differenciates by functionality (which pysal library the vis relates to) and by technology (the backend to use).
  • It comes with a structure more or less defined that aligns directly with the pysal structure and with the vis greater eco-system in Python, putting the former first. Every visualisation will be hierarchically linked to the submodule where it adds functionality to (or where it gets its analytics from). This way, if somebody is not familiar with splot but knows pysal and knows they want to visualise output from a particular submodule, it's (more or less) intuitive to find it.
  • It allows to keep a single higher-level structure and add more backends as we move along. With this approach, if you add functionality based on vega, that automatically is located in the library based on what the functionality adds, you don't have to replicate the tree under a new submodule (e.g. splot.vega.giddy).

Con

  • It's very very modular, so from the start it'd have a relatively complex set of files, hierarchies, etc. However, I think it'd be accommodating of many cases moving forward, with very little additional effort.
  • It forces us to think from the beginning, and every time, what the hierarchy of functionality is. As @slumnitz says "most statistical analysis seem to use more than one python subpackage,
    resulting in a need to import multiple visualisation subpackages". For example, directional LISA might use code from esda for the lisa and core for a choropleth. The lineage there is clear (basic rendering in core, choropleth shading in esda, directional functionality in giddy), but this network of dependencies might not always be as straightforward. I'll say though, I think this is a feature not a bug in that it'll help us think through also how functionality is structured among the greater PySAL eco-system 😜

As for designing functions, again I think this is a great start @slumnitz. My sense would be to start by the functionality we want to support. For example, maybe we could say that the lead developer of each package could provide the list of visualisations to work on. We could start with esda and giddy as you've done. Then think of what features to support ideally, and then adapt those to the possibilities of each backend.

For example, for esda:

  • Choropleth functionality for each algorithm in esda.mapclassify.
  • Spatial Autocorrelation:
    • Global Moran scatter plot (splot.esda.Moran_Scatterplot?)
    • Global Moran Bivariate scatter plot (splot.esda.Moran_BV_Scatterplot?)
    • Global Moran Bivariate scatter plot facetting matrix (splot.esda.Moran_BV_facet?)
    • Global Moran Rate scatter plot (splot.esda.Moran_Rate_Scatterplot?)
    • Local Moran map (splot.esda.Moran_Local_Map?)
    • Local Moran Bivariate map (splot.esda.Moran_Local_BV_Map?)
    • Local Moran Rate map (splot.esda.Moran_Local_Rate_Map?)
    • Local Getis-Ord map (splot.esda.G_Local_Map?)
  • Choropleth functionality for each algorithm in esda.smoothing (splot.esda.smoothing.XXX_Map?, which could take any of the choropleth options as an argument to classiffy colors.)

This is definitely bikeshedding, but I would prefer one name and a backend flag/option that defaults to matplotlib. I do not think it makes sense to build in the other package names and things directly into a namespace in the API, although it does make sense from a function dispatch and a file organization perspective.

Again, consider that most users are going to focus on getting the plot and not what rendering engine it uses. To me, this would be like if your calls in matplotlib required you to specify pyplot.agg.plot versus pyplot.tk.plot or pyplot.qt.plot. We need to do this in a way that makes this as invisible as possible.

This is option 2 in the first comment so far as I understand it. Also, under option 2, we could also set this using some kind of splot.backend() function so that you could switch to bokeh (or mpl) once and never think about it again.

I like this because I think it's the smallest maintainable API, it's used elsewhere, and I don't think the cons are accurately assessed?

return to moran_scatterplot(backend='bk')?

yes, or possibly letting a splot.backend() adjust the default so that you could just do this once.

installation of both bokeh, mpl (and more) needed

No, I don't see this as needed any more than other cases:

if backend.lower() == 'bokeh': 
    from splot.whatever import plot_bokeh_stuff
    plot_bokeh_stuff
elif backend.lower() == 'matplotlib':
    from splot.whatever import plot_matplotlib_stuff
    plot_matplotlib_stuff

This would require that something be installed if the user asks for it explicitly. If the user asks for it and doesn't have it, then (and only then) will it error.

Challenging to decide which function is found under which namespace if visualisations are used for multiple purposes

Expose a plotting function everywhere where it might be used. If we go for a semantically-meaningful API (rather than a program structure API) then I think this makes sense. Anywhere where we'd expect a user to want a Moran plot, expose it.

Or, instead of tying functions to each package, we could take a stab at making a conceptual structure for the plots themselves, like dynamics (containing rose diagrams, spaghetti plots, lisa markov plots, etc) , relation (containing Moran scatters, bivariate choropleths), description (containing choropleths? not sure).

Most statistical analysis seem to use more than one python subpackage, resulting in a need to import multiple visualisation subpackages question if it is necessary to subdivide?

Not sure how this is a bad thing? Again, if we're composing a semantically meaningful api and need to expose moran_scatterplot in the esda module for plotting data and also in the spreg module for plotting residuals, I think that's fine.

@darribas and @ljwolf thank you for your comments and additions! I added your comments and suggestions to the option overview.

I think making a detailed overview of visualisations to support each subpackage is a good idea. I started a collection space here: #10.

Looking forward to discuss more ideas tomorrow!

This seems to be getting fleshed out already, but figured I'd just add my quick two cents. I agree with not including package names/backends in the namespaces. Even though it is the most modular, it requires users to do a lot more thinking about what they want to import. I really like the simplicity of relying on default backend, and just being able to change it if you are interested in doing so.

I also agree with exposing plotting functions wherever it is thought that they might be useful. This is flexible for future cases say where it is realized that an existing plot is useful in a new package or once new features are added to a package.

In the other option, wouldn't it make the imports redundant? Say if for the case of plotting both for data and then residuals:

from splot.esda.mpl import Moran_scatterplot
from splot.spreg.mpl import Moran_scatterplot

Seems like this can be closed now. Please reopen if I am jumping the gun here.