Package Structure and API design
slumnitz opened this issue · 9 comments
Ideas and decisions regarding the design of package structure and API for splot
Overall package idea
- support PySAL with lightweight plotting functionality:
.plot
methods for objects called fromsplot
and- functionality found under
splot.sub_package
namespace depending on PySAL object that is plotted
- Based on
matplotlib
andgeopandas
splot.mapping
offers tools for coropleth, color and mapping support
Decisions (indicating implementation status):
- integration of
splot.plot.py
intosplot._viz_mpl.py
- integration of
giddy
visualisations insplot
- update and integrate functionality from
mapping.py
intosplot._viz_mpl.py
-
.mpl
functions returnfig, ax
- API Structure of
splot
(See API - Package Structure comment):- hierarchy with submodules named after statistics visualised (
splot.giddy
), common_viz_utils.py
for utility functions used by all subpackages - integrating
splot
functionality as.plot
methods in sub packages (see #10 for supported functionality
- hierarchy with submodules named after statistics visualised (
- degree of customisation user can access: high degree of customisation through keyword dictionaries
- dependency on:
- Matplotlib only
- Geopandas
- Seaborn
- documentation in accordance to general PySAL [submodule contract](url
http://pysal.org/getting_started.html#submodule-contract)
To be decided:
Open questions/ ideas:
Rejected options:
- API Structure of
plot
(See API - Package Structure comment):- hierarchy with submodules, one per plotting package (now
splot.mpl
andsplot.bk
), chose function names according to statistics they are helping to visualise (e.g.moran_scatterplot()
,local_autocorrelation_statistics()
...)
- hierarchy with submodules, one per plotting package (now
- API Structure of
plot
(See API - Package Structure comment):- same implementations of visualisations in each (e.g.
.mpl
) submodule:
* e.g.mplot
,plot_choropleth
,plot_local_autocorrelation
,... forbk
andmpl
* except if one plotting package doesn't have the needed features (e.g. if interactivity is essential,matplotlib
may not be able to create such a plot)
- same implementations of visualisations in each (e.g.
- keywords common to all functions in the same order:
- e.g. ...(moran_loc, df, attribute, p, region_column, plot_width, plot_height,...)
- for
splot.mpl
only:ax
- for
splot.bk
only:reverse_colors
plot(method="interactive")
option, more flexibility by integrating submodules instead of function calls
API - Package Structure
OPTION 1:
from splot.mpl import moran_scatterplot
from splot.bk import moran_scatterplot
Pro:
- User can choose to install either
bokeh
, ormpl
, or both - functions are found by function name, it is clear where to look, no confusion about
e.g. whylisa_cluster()
is inesda
and not ingiddy
, even if it is also used ingiddy
visualisation - Easy to add on other visualization packages like Altair (in the fast paced visualization space)
e.g fromsplot.altair import choropleth
Con:
- Challange in choosing function names, so the user can easily see which visualisations to use for which subpackage
- Question of consistency of parameters used in functions, e.g. different parameters used for visualisations in different subpackages.
OPTION 2:
from splot.giddy import directional_lisa
from splot.esda import directional_lisa
from splot.utility import ...
With options:
splot.backend()
if backend.lower() == 'bokeh':
from splot.whatever import plot_bokeh_stuff
plot_bokeh_stuff
elif backend.lower() == 'matplotlib':
from splot.whatever import plot_matplotlib_stuff
plot_matplotlib_stuff
moran_scatterplot(backend='bk')
- function used in multiple subpackages assigned to multiple subpackages
Pro:
- Visualisation functions within representing a subpackage
giddy
are called with similar/ the same parameters, therefore the namespacessplot.esda
(mpl version) would be more consistent thansplot.mpl
(containing all functions)
Con:
- Most statistical analysis seem to use more than one python subpackage,
resulting in a need to import multiple visualisation subpackages
question if it is necessary to subdivide?
Resolved with additional options(see above)
- How to design access to interactive or static functionality?
- return to
moran_scatterplot(backend='bk')
? - installation of both
bokeh
,mpl
(and more) needed
- return to
- Challenging to decide which function is found under which namespace, if visualisations are used for multiple purposes
- If only one PySAL subpackage is used, it is clear where to find it's visualisations
OPTION 3:
from splot.giddy.bk import ...
from splot.esda.mpl import ...
from splot.utility import e.g. mapclassify, legendgram ...
Pro:
- It's as modular as it can be, as it differenciates by functionality (which
pysal
library the vis relates to) and by technology (the backend to use). - It comes with a structure more or less defined that aligns directly with the
pysal
structure and with the vis greater eco-system in Python, putting the former first. Every visualisation will be hierarchically linked to the submodule where it adds functionality to (or where it gets its analytics from). This way, if somebody is not familiar withsplot
but knowspysal
and knows they want to visualise output from a particular submodule, it's (more or less) intuitive to find it. - It allows to keep a single higher-level structure and add more backends as we move along. With this approach, if you add functionality based on
vega
, that automatically is located in the library based on what the functionality adds, you don't have to replicate the tree under a new submodule (e.g.splot.vega.giddy
).
Con:
- It's very very modular, so from the start it'd have a relatively complex set of files, hierarchies, etc. However, I think it'd be accommodating of many cases moving forward, with very little additional effort.
- It forces us to think from the beginning, and every time, what the hierarchy of functionality is. As @slumnitz says "most statistical analysis seem to use more than one python subpackage,
resulting in a need to import multiple visualisation subpackages". For example, directional LISA might use code from esda for the lisa and core for a choropleth. The lineage there is clear (basic rendering incore
, choropleth shading inesda
, directional functionality ingiddy
), but this network of dependencies might not always be as straightforward. I'll say though, I think this is a feature not a bug in that it'll help us think through also how functionality is structured among the greater PySAL eco-system 😜
OPTION 4:
from splot.dynamics import (rose diagrams, spaghetti plots,
lisa markov plots...)
from splot.relation import (Moran_scatterplot, bivariate choropleth)
from splot.description import (Choropleth)
Main difference: Conceptual structure for plots
Other variations:
scatterplots
,mapping
,composite
,utility
, ...
Pros:
Cons:
Note:
To date, underlying .py
file structure will be separated into subpackages, or proposed pysal structure (meta2), and backend:
- e.g.
_viz_lib_mpl.py
,_viz_lib_bokeh.py
- e.g.
_viz_explore_mpl.py
,_viz_explore_bokeh.py
Designing Functions for Splot - Parameters and Returns
OPTION 1:
Main difference: Moran_local calculated in plotting function
Consistent for all functions
Parameters
Parameters | mpl | Bokeh |
---|---|---|
main params ESDA | gdf, attribute, w | gdf, attribute, w |
main params Giddy | gdf, timex, timey, w | tba |
Significance parameters | p | |
Choro parameters | method | method, k |
Interactivity | region_column, mask, mask_color, quadrant | hover_poly_id |
Figure design | legend, cmap, alpha, lgend_kwds | reverse_color |
---------------------------- | ------------------------------------------- | ------------------- |
Returns | fig, ax | fig |
mpl function examples:
moran_scatterplot(gdf, attribute, w,
p=0.05, ax=None,
alpha=0.6)
lisa_cluster(gdf, attribute, w,
p=0.05, ax=None,
legend=True, legend_kwds=None)
plot_local_autocor(gdf, attribute, w,
p=0.05,
region_column=None, mask=None,
mask_color='#636363', quadrant=None,
method='Quantiles',
legend=True, cmap='YlGnBu')
space_time_heatmap(gdf, timex, timey, w,
p=0.05, ax=None)
def space_time_correl(gdf, timex, timey, w
p=0.05)
Bokeh function examples:
plot_choropleth(gdf, attribute,
method='quantiles', k=5,
reverse_colors=False,
hover_poly_id='')
moran_scatterplot(gdf, attribute, w,
p=0.05,
hover_poly_id='')
lisa_cluster(gdf, attribute, w,
p=0.05,
hover_poly_id='')
plot_local_autocor(gdf, attribute, w,
p=0.05,
method='quantiles', k=5,
reverse_colors=False,
hover_poly_id='')
OPTION 2:
Main difference: moran_locasl as input
A: Consistent for all functions
B: simplyfied for composite plots
Parameters
Parameters | mpl | Bokeh |
---|---|---|
main params ESDA | gdf, attribute, moran_loc, | gdf, attribute, w |
main params Giddy | gdf, moran_loc1, moran_loc2, w, y1, y2 | tba |
Significance parameters | p | |
Choro parameters | method | method, k |
Interactivity | region_column, mask, mask_color, quadrant | hover_poly_id |
Figure design | legend, cmap, alpha, lgend_kwds | reverse_color |
---------------------------- | ------------------------------------------- | ------------------- |
Returns | fig, ax | fig |
mpl function examples:
moran_scatterplot(moran_loc,
p=0.05, ax=None,
alpha=0.6)
lisa_cluster(gdf, moran_loc,
p=0.05, ax=None,
legend=True, legend_kwds=None)
plot_local_autocor(gdf, attribute, moran_loc,
p=0.05,
region_column=None, mask=None,
mask_color='#636363', quadrant=None,
method='Quantiles',
legend=True, cmap='YlGnBu')
space_time_heatmap(moran_loc1, moran_loc2,
p=0.05, ax=None)
A:
space_time_correl(gdf, moran_loc1, moran_loc2, w, y1, y2,
p=0.05)
B:
space_time_correl(gdf, timex, timey,
p=0.05)
Bokeh function examples:
plot_choropleth(gdf, attribute,
method='quantiles', k=5,
reverse_colors=False,
hover_poly_id='')
moran_scatterplot(moran_loc,
p=0.05,
hover_poly_id='')
lisa_cluster(gdf, moran_loc
p=0.05,
hover_poly_id='')
plot_local_autocor(gdf, attribute, moran_loc,
p=0.05,
method='quantiles', k=5,
reverse_colors=False,
hover_poly_id='')
Notes:
- title, x and y axis labels can be added to both mpl and bokeh versions
eg. Bokeh:fig = figure()/ lisa-cluster(...) t = Title() t.text = 'Lisa Cluster Map' fig.title = t
Discussion points:
- How many 'figure design' options (title, xlable, ylable...) to include in
- [ ] Bokeh
- [ ] mpl - Which option (1, 2A, 2B) for continuous design of functions? (For now)
- function design connected to overall API package design:
- inputs consistent for all functions
- inputs consistent for visualisation functions in specific subpackages
This is fantastic @slumnitz!!! Thank you very much for taking the plunge and have a first go!
On the general API design, my sense would be to have a hierarchical structure that begins with the pysal
submodule (concept first) and continues with backend next (tool second). I guess this is an Option 5 🤣, but closest to your Option 3 I think:
from splot.esda.mpl import Moran_scatterplot
from splot.giddy.bk import Rose_plot
Pro
- It's as modular as it can be, as it differenciates by functionality (which
pysal
library the vis relates to) and by technology (the backend to use). - It comes with a structure more or less defined that aligns directly with the
pysal
structure and with the vis greater eco-system in Python, putting the former first. Every visualisation will be hierarchically linked to the submodule where it adds functionality to (or where it gets its analytics from). This way, if somebody is not familiar withsplot
but knowspysal
and knows they want to visualise output from a particular submodule, it's (more or less) intuitive to find it. - It allows to keep a single higher-level structure and add more backends as we move along. With this approach, if you add functionality based on
vega
, that automatically is located in the library based on what the functionality adds, you don't have to replicate the tree under a new submodule (e.g.splot.vega.giddy
).
Con
- It's very very modular, so from the start it'd have a relatively complex set of files, hierarchies, etc. However, I think it'd be accommodating of many cases moving forward, with very little additional effort.
- It forces us to think from the beginning, and every time, what the hierarchy of functionality is. As @slumnitz says "most statistical analysis seem to use more than one python subpackage,
resulting in a need to import multiple visualisation subpackages". For example, directional LISA might use code from esda for the lisa and core for a choropleth. The lineage there is clear (basic rendering incore
, choropleth shading inesda
, directional functionality ingiddy
), but this network of dependencies might not always be as straightforward. I'll say though, I think this is a feature not a bug in that it'll help us think through also how functionality is structured among the greater PySAL eco-system 😜
As for designing functions, again I think this is a great start @slumnitz. My sense would be to start by the functionality we want to support. For example, maybe we could say that the lead developer of each package could provide the list of visualisations to work on. We could start with esda
and giddy
as you've done. Then think of what features to support ideally, and then adapt those to the possibilities of each backend.
For example, for esda
:
- Choropleth functionality for each algorithm in
esda.mapclassify
. - Spatial Autocorrelation:
- Global Moran scatter plot (
splot.esda.Moran_Scatterplot
?) - Global Moran Bivariate scatter plot (
splot.esda.Moran_BV_Scatterplot
?) - Global Moran Bivariate scatter plot facetting matrix (
splot.esda.Moran_BV_facet
?) - Global Moran Rate scatter plot (
splot.esda.Moran_Rate_Scatterplot
?) - Local Moran map (
splot.esda.Moran_Local_Map
?) - Local Moran Bivariate map (
splot.esda.Moran_Local_BV_Map
?) - Local Moran Rate map (
splot.esda.Moran_Local_Rate_Map
?) - Local Getis-Ord map (
splot.esda.G_Local_Map
?)
- Global Moran scatter plot (
- Choropleth functionality for each algorithm in
esda.smoothing
(splot.esda.smoothing.XXX_Map
?, which could take any of the choropleth options as an argument to classiffy colors.)
This is definitely bikeshedding, but I would prefer one name and a backend flag/option that defaults to matplotlib. I do not think it makes sense to build in the other package names and things directly into a namespace in the API, although it does make sense from a function dispatch and a file organization perspective.
Again, consider that most users are going to focus on getting the plot and not what rendering engine it uses. To me, this would be like if your calls in matplotlib required you to specify pyplot.agg.plot
versus pyplot.tk.plot
or pyplot.qt.plot
. We need to do this in a way that makes this as invisible as possible.
This is option 2 in the first comment so far as I understand it. Also, under option 2, we could also set this using some kind of splot.backend()
function so that you could switch to bokeh (or mpl) once and never think about it again.
I like this because I think it's the smallest maintainable API, it's used elsewhere, and I don't think the cons are accurately assessed?
return to moran_scatterplot(backend='bk')?
yes, or possibly letting a splot.backend()
adjust the default so that you could just do this once.
installation of both bokeh, mpl (and more) needed
No, I don't see this as needed any more than other cases:
if backend.lower() == 'bokeh':
from splot.whatever import plot_bokeh_stuff
plot_bokeh_stuff
elif backend.lower() == 'matplotlib':
from splot.whatever import plot_matplotlib_stuff
plot_matplotlib_stuff
This would require that something be installed if the user asks for it explicitly. If the user asks for it and doesn't have it, then (and only then) will it error.
Challenging to decide which function is found under which namespace if visualisations are used for multiple purposes
Expose a plotting function everywhere where it might be used. If we go for a semantically-meaningful API (rather than a program structure API) then I think this makes sense. Anywhere where we'd expect a user to want a Moran plot, expose it.
Or, instead of tying functions to each package, we could take a stab at making a conceptual structure for the plots themselves, like dynamics
(containing rose diagrams, spaghetti plots, lisa markov plots, etc) , relation
(containing Moran scatters, bivariate choropleths), description
(containing choropleths? not sure).
Most statistical analysis seem to use more than one python subpackage, resulting in a need to import multiple visualisation subpackages question if it is necessary to subdivide?
Not sure how this is a bad thing? Again, if we're composing a semantically meaningful api and need to expose moran_scatterplot
in the esda
module for plotting data and also in the spreg
module for plotting residuals, I think that's fine.
@darribas and @ljwolf thank you for your comments and additions! I added your comments and suggestions to the option overview.
I think making a detailed overview of visualisations to support each subpackage is a good idea. I started a collection space here: #10.
Looking forward to discuss more ideas tomorrow!
This seems to be getting fleshed out already, but figured I'd just add my quick two cents. I agree with not including package names/backends in the namespaces. Even though it is the most modular, it requires users to do a lot more thinking about what they want to import. I really like the simplicity of relying on default backend, and just being able to change it if you are interested in doing so.
I also agree with exposing plotting functions wherever it is thought that they might be useful. This is flexible for future cases say where it is realized that an existing plot is useful in a new package or once new features are added to a package.
In the other option, wouldn't it make the imports redundant? Say if for the case of plotting both for data and then residuals:
from splot.esda.mpl import Moran_scatterplot
from splot.spreg.mpl import Moran_scatterplot
Seems like this can be closed now. Please reopen if I am jumping the gun here.