TigerAppsOrg/pdata

Determine interface per dataset

Closed this issue · 4 comments

The implementation of a single dataset should follow a set interface, so we should determine what that actually consists of. For example, the following operations are definitely needed:

  • acquiring new data
  • refreshing existing data
  • check permissions (this can be done within DRF)

And the following configuration:

  • period at which to update/fetch data
  • database models
  • serialization of data into an API
  • permission scheme

Although each dataset can exist as its own Django app, they should still conform to a single interface, at least for the operations that it does not directly do (i.e. the 'courses' dataset would define how to update its data and how to get new data, but it wouldn't actually perform those until told to do so externally).

I think if you are attempting this, the most cost-efficient solution is to use a provider layer like the open-source Kong: https://getkong.org/

This layer can sit on top of the actual Django app, and provide uniform permissioning, rate limiting, quotas, etc. on top

Indeed quotas, rate limiting, etc. are essential for a production-ready server, because you have to assume there will always be bad actors who write bad code that makes an insane number of queries, for instance.

@michaeljfriedman would you like to take a look?

Yes I will look into this.

@jlumbroso I'll look into using Kong. However, I meant in terms of the actual datasets, what operations would we need to support? For example, if I have the collection of "course" data, what would a manager of that dataset need to perform? The simplest is to just support updating that data; I'm not sure if we should be more granular than that.

Specifically, imagine we have multiple datasets: courses, students, network data, etc. Each has associated data and exists as its own Django "app", within the main endpoint. To run each endpoint, first we need to load the initial data and then periodically, update it. However, the update is of multiple steps (add new data, update existing, delete removed, etc.). The periodic updates are scheduled and run externally, so the dataset code would just define a function to perform the update.

If the dataset code is simply given the responsibility of doing all the updates (and the scheduler just calls something like CourseEndpoint.update()), then it can do whatever needed. However, it seems more efficient to only call granular operations.

It seems that different datasets, however, have different features that need to be supported; for example, courses data can be wiped and scraped anew (for a given semester), whereas network data would not modify the previous data.