CartoDB/cartoframes

v1.0 approach: pure functional vs subclass

Closed this issue · 8 comments

The goal of this ticket is to make a decision looking forward to the next stable release v1.0 in January. I prefer having this discussion open, instead of offline, to avoid starting every time a new member enters the debate. I think we have made good progress grouping all the possible approaches and we have improved a lot the library along the way, so this is the final cherry to complete the 1.0 cake.

The different approaches are:

Pure functional

It provides only functions with the CARTO features (visualization, data management, etc). The user uses pandas and geopandas to create DataFrames and GeoDataFrames. Every function accepts DF(+geom_col) or GDF and returns a GDF.

Subclass

This approach includes the pure functional approach, and adds a new class CartoDataFrame which extends DataFrame and GeoDataFrame and provides quick access to CARTO features: cdf.to_carto(), cdf.viz(). Every function accepts DF, GDF and CDF and returns a CDF.

Accessor

This consists of adding an object with the carto features to both DF and GDF: df.carto.to_carto(), gdf.carto.viz().

We don't have time before the 1.0 release to analyze in-depth this approach, but it's something we can study and add in future versions.

Pandas-flavor

This is a way of "monkey-patching" a GDF to add those quick access functions of the CDF: gdf.to_carto(), gdf.viz().

We also don't have time to analyze this, but we can do it after the release.

Composition

This proposal consists of creating a custom class CartoFrame which contains a DF or GDF: cf.frame. Generally, this is the less appreciated option, also we have been there somehow in the past (dataset.dataframe).

Let's focus

After describing all the approaches, it's clear that, for the next stable release, we must focus on the first two: pure functional and subclass. So we discard the other ones for this release, but they can be tested and evaluated more in the future.

I'll present here some examples of both approaches, trying to point out the differences. In order to be efficient, I wouldn't look at this as a "which one do you prefer" question, but a "what do we need to change/add to be happy".

Pure functional

Advantages

  • Provides the basics with plain functions
  • We can add other approaches in the future
  • It does not add new concepts

Disadvantages

  • More verbose (more imports)
  • User needs to handle geom decoding

Neutral

  • Explicit use of pandas and geopandas
from pandas import read_csv

df = read_csv('file.csv')

from geopandas import GeoDataFrame, points_from_xy
from cartoframes.utils import decode_geometry

gdf = GeoDataFrame(df, geometry=decode_geometry(df['the_geom']))
# gdf = GeoDataFrame(df, geometry=points_from_xy(df['longitude'], df['latitude']))

from cartoframes import Map, Layer

Map(Layer(gdf))

from cartoframes import to_carto, read_carto
from cartoframes.auth import Credentials

creds = Credentials.from_file('...')
to_carto(gdf, 'table_name', creds)

gdf2 = read_carto('table_name', creds)

from geopandas import to_file

to_file(gdf2, 'file.geojson')

Subclass

Advantages

  • Less verbose (fewer imports)
  • Includes also the functional approach
  • Cleaner API, better for product marketing

Disadvantages

  • Difficult deprecation (if so) path after 1.0
  • A new concept to introduce/learn

Neutral

  • Compatible with accessor and pandas-flavor approaches
from cartoframes import CartoDataFrame
from cartoframes.auth import Credentials

cdf = CartoDataFrame.from_csv('file.csv', geometry='the_geom')
# cdf = cdf.set_geometry_from_xy('longitude', 'latitude')

cdf.viz()

creds = Credentials.from_file('...')
cdf.to_carto('table_name', creds)

CartoDataFrame.from_carto('table_name', creds).to_file('file.geojson')

Conclusion

I would love to hear everyone's opinion. The good point is that we all want to deliver the best to our users, so let's do it 💪

For me, it should be pure functional with theses improvements to remove the disadvantages:

from cartoframes import read_csv_xy, read_carto, read_file, to_carto, to_file, read_csv_geom
gdf = read_carto('table')
# gdf = read_file('file.shp')  # Wrapper of geopandas
# gdf = read_file('file.geojson')  # Wrapper of geopandas
# gdf = read_csv_xy('file.csv', 'longitude', 'latitude')
# gdf = read_csv_geom('file.csv', geom_column='the_geom', format='wkb')
from cartoframes import Map, Layer
Map(Layer(gdf))
to_carto(gdf, 'table_name')
gdf2 = read_carto('table_name')
to_file(gdf2, 'file.geojson')

What I propose is to have helper functions (in some cases just wrappers of GeoPandas) to reduce the impact of verbosity and the creation of geometries.

I'd be quite direct in the doc telling that the function is a wrapper (adding a link to pandas doc). In general, I don't like to do wrappers when they don't provide additional value. However, in this case, it reduces verbosity and it's easier for users who don't know too much about GeoPandas and only use CARTOframes for visualization

EDIT: Approach to reduce the list of functions to import

import cartoframes as cf
gdf = cf.read_carto('table')
# gdf = cf.read_file('file.shp')  # Wrapper of geopandas
# gdf = cf.read_file('file.geojson')  # Wrapper of geopandas
# gdf = cf.read_csv_xy('file.csv', 'longitude', 'latitude')
# gdf = cf.read_csv_geom('file.csv', geom_column='the_geom', format='wkb')
from cartoframes import Map, Layer
Map(Layer(gdf))
cf.to_carto(gdf, 'table_name')
gdf2 = cf.read_carto('table_name')
cf.to_file(gdf2, 'file.geojson')

My idea was to go with both approaches, checking in the future what works better for users. I think we are close to have both working at the same time, using one without knowing the other. For that, we would still need to do things:

  • CartoDataFrame: we should add enrichment, geocoding, ... to the class
  • functional: we should remove "empty" classes to be more functional like Enrichment, Geocode, Isolines (these 2 are not totally "empty")

But I agree ideally go with just one is better. So, if I have to select one, I would go with the functional one because:

  • it is easier for us to do it and it is easier for users to understand it (I understand that geoPandas is better known)
  • we can add the class in the future breaking nothing

After an internal meeting, we’ve finally decided to go with the pure functional approach and delay for one week the v1.0 of CARTOframes. @jorisvandenbossche please, could you take a look at the above approach to reduce the disadvantages and give us feedback?

I agree on the functional approach and, while I think it'd be nice to have methods instead of some functions (such as df.to_carto), we can always register these later, via an accessor or using Pandas-flavor, as @jorisvandenbossche suggested

OK, let's go with this approach. It's the most conservative one, so we can explore and incrementally extend the integration with pandas/geopandas in future releases.

Regarding the changes and the final API. Although wrapping the pandas/geopandas functions could be useful in some situations I would try to avoid doing this and keep pure pandas/geopandas functions whenever possible. Otherwise, we need to keep every method synced and also duplicate the documentation in our library.

Since we are not going to extend or wrap pandas/geopandas, I would try to provide only what we are adding new in CARTOframes: auth, viz, data, IO methods, and the decoding part. For this last part, I would add in cartoframes.utils the following utility functions (I will use them in the refactor of the lib):

from cartoframes.utils import decode_geometry, set_geometry, set_geometry_from_xy

# The user can decode a "str" geometry column
geometry = decode_geometry(df['the_geom'])

# The user can set a geometry column with implicit decoding
gdf = GeoDataFrame(df)
set_geometry(gdf, 'the_geom', inplace=True)

# The use can set a geometry column from xy
gdf = GeoDataFrame(df)
set_geometry_from_xy(gdf, 'lng', 'lat', inplace=True)

NOTE: set_geometry and set_geometry_from_xy are optional helpers we can provide or not. Only decode_geometry is really required.

We will be able to implement the required changes next week, so it would be nice if we can close the definition of the changes for that moment.

Sounds good to me

OK, let's go with the simplest approach and then we can incrementally add "utility" methods if required.

@jorisvandenbossche please, could you take a look at the above approach to reduce the disadvantages and give us feedback?

Sorry for the slow reply here (the notification got a bit lost during the holidays).

So in general I am of course supportive of the approach! Regarding the verbosity of creating a GeoDataFrame from certain sources (mainly csv I think?), where you need to use decode_geometry or points_from_xy or ...: I think it can certainly make sense to provide some wrapper function in CARTOframes that makes this easier. But also feel free to propose something in GeoPandas to improve this use case.

And congrats on the 1.0 release!