datadesk/census-data-downloader

Add additional methods to base classes to let users support additional sources

ghing opened this issue · 7 comments

ghing commented

This is somewhat related to #2.

I find this project to be extremely useful and a great framework for a task that I have to do often. In my projects, I've found myself using the base classes and concepts from this project when I want to download and process data from other Census Bureau API sources.

However, for non-ACS sources, I find myself entirely reimplementing many of the methods on my geotype downloader classes because the changes in functionality aren't possible by just calling super() and then adding additional logic.

I think adding these methods to BaseGeoTypeDownloader could make adding additional data sources easier, both in this project, and for other users in their own projects:

  • BaseGeoTypeDownloader.get_api_client(): This would be called from the constructor to set sefl.api and allow subclasses to specify a customized subclass of census.Census that supports additional API endpoints.
  • BaseGeoTypeDownloader.get_field_type_map(): This would be similar to BaseGeoTypeDownloader.get_raw_field_map() except it would map from raw field names to types that would be passed to pd.Series.astype(). Like BaseGeoTypeDownloader.get_raw_field_map(), this would be called from BaseGeoTypeDownloader.process() when setting the column types after reading in the raw table. The implementation could check for the existence of a FIELD_TYPES attribute on the table configuration class, and if that doesn't exist, default to the existing logic for ACS tables that checks the field name suffix. Adding the ability to explicitly set type conversions allows supporting non-ACS tables that might have field names that don't have the same suffix convention as ACS tables.
ghing commented

Here are some examples from customized subclasses I've implemented for my own data loading project to support additional sources. They might be useful to understand what I'm talking about in this issue.

I love the idea of integrating features you've added, but I think we'd probably be best off taking things on a case by case basis, with a clear vision for what new use case the individual change would allow.

Is there a feature addition you would propose for the end user? Is it supporting a data source beyond ACS? Something else?

ghing commented

@palewire to clarify, this is less adding a feature or support for a specific data source in the CLI, and instead making backward-compatible changes to the Python API that would make it easier for users of the Python API to add support for other data sources from the Census Bureau's API in their own projects. These changes may also make it easier to add support for additional sources in this tool (i.e. #2).

I ran into the need for this when writing code to download and process data from the self-response rate endpoint. That support definitely doesn't need to be in this library/the CLI, but it would be great to make it easier to use code and conventions in this package to support consuming data from other Census API endpoints.

The changes I describe above address these two needs (for Python API users, not CLI users):

  • The ability to use a custom subclass of census.Census to add support for additional Census Bureau API endpoints that aren't currently supported by the census package.
  • The ability to properly interpret the field types for tables that don't follow the ACS' field name suffix convention

I'm not sure whether the approaches I've taken in my code are the best way to address these needs, but I wanted to document them in case you all have run into this internally when thinking about how to pull data from the Census API for sources that aren't ACS tables.

Gotcha. I'm not opposed to such changes, I just want them to be pegged to new features for the user of this library, which I think could bring some focus to the work. That way the edits aren't academic but are integrated with the code here from the start. In other words, I don't want to prematurely optimize.

For instance, if we set the goal of integrating the three and one year samples from the ACS into this library, could adding that feature naturally also include some of the refactoring you propose?

ghing commented

Supporting other ACS releases wouldn't require these changes. Supporting decennial tables, like sf1, would require a way to specify field types on a per source/table basis.

A hook to support a different client class wouldn't be required by either of these additions. That's only needed for supporting API data sources that aren't supported by the census package.

Got it. With the decennial census coming out this year, maybe it's a good time to figure out SF1. Have you integrated it downstream in any of your stuff?

ghing commented

I haven't integrated SF1 yet, but I'm likely going to be using some tables (e.g. P1) soon. I'll update this issue with any relevant findings or bits of example code.