Files

This repository contains files for exploring OpenActive data via Python. The main content can be found in the app.py file and the cache/ directory, with the latter containing files generated by the former. It is recommended to use these existing files to quickly begin exploring the data, but it is possible and encouraged to refresh these files to get the latest information for your own use when running elsewhere. You can always obtain the original files from here if you want to roll back from the versions you generate yourself. Note that there is a lot of metadata stored in the cached files that you may not wish to see, but the functions in app.py can be used to read and return the contents with a varying level of detail, as explained below. All other files herein are simply setup files for running in various ways in various environments. Explicitly, we have:

For all running options:

cache/
app.py

For running locally:

requirements.txt

For running on Heroku:

Procfile
requirements.txt
runtime.txt

For running on Binder:

environment.yml
index.ipynb

Running

To run locally, first clone this repository to a destination of your choice, and make sure that you've installed the Python packages listed in requirements.txt. You may wish to do this in an encapsulated virtual environment that can be used just for this code, to ensure that it runs as intended and is fully isolated from your base environment. The only thing that must be installed in your base environment is the virtualenv Python package, so if you use the pip Python package manager then do:

$ pip install virtualenv

Then set up the virtual environment and the required packages for this code by:

$ virtualenv virt
$ source virt/bin/activate
(virt) $ pip install -r requirements.txt

Note that some virtual environment files include the name of the directory in which the environment was created. So if you change the directory name at a later time, then you will have to recreate the environment. This is done simply by deleting the existing virt/ directory and running the above creation steps again.

Working with Python in the running virtual environment (either in the console, a Jupyter notebook, or a file), you can import and use the contents of app.py by doing such as the following (change SOME_OUTPUT and SOME_FUNCTION accordingly based on the functions described below):

>>> import app as oa
>>> SOME_OUTPUT = oa.SOME_FUNCTION()

Alternatively you can run the code as a Flask microservice, and app.py is already set up for that purpose. Again in the virtual environment established above, simply do the following:

(virt) $ python app.py
    * Serving Flask app 'app'
    * Debug mode: off
    WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
    * Running on http://127.0.0.1:5000

You can then access the function endpoints via a web browser or a tool like Postman. The endpoints are all named the same as the functions that they run, so to run a function called SOME_FUNCTION you would visit http://127.0.0.1:5000/SOME_FUNCTION. Furthermore, as described below, there are a number of boolean keyword arguments available to many functions. In order to apply these when running via Flask you would visit http://127.0.0.1:5000/SOME_FUNCTION?SOME_KEYWORD1=True&SOME_KEYWORD2=True, and so on.

Usage

The content of this section is repeated in the index.ipynb Jupyter notebook for you to run yourself. If you would like to explore this further without installing anything, simply click the "launch binder" badge beneath the OpenActive logo at the top of this page, and a cloud based virtual machine running the notebook in your web browser will be prepared within a few minutes. You can then run the cells by selecting them one-by-one and pressing Shift-Enter, and add your own Python code in further cells to explore the data as you like. Note that the virtual machine will stop running after a few minutes of inactivity and cannot be restarted, you will have to begin again from scratch, so be sure to note any useful code that you generate if you do intend to leave the service idle for a while.

The underlying code of the functionality described below all exists in the app.py file, so feel free to explore the content there otherwise just use whichever function satisfies your required level of detail. Note that the output data has been simplified from the source data in many cases, and it is recommended to follow the source URLs directly to see the raw content at each stage if needed. Also note that everything presented here should be taken as a guide and exploratory sandbox rather than a standard official toolset.

Gathering the data

Before we begin, we need to understand the nature of the data and how it is found in the wild. An "opportunity" is the basic OpenActive data block, which may represent anything from a series of activity sessions, to a single activity session, to an available booking slot for a facility like a tennis court. A data provider will gather together all opporunities of a certain type into a "feed", and gather all feeds from all opportunity types into a "dataset". Multiple datasets from various providers are bundled into a "catalogue", and these in turn are brought into a "collection", which is our starting point in seeking the opportunity data. The journey of Python functions and their outputs is then:

Function	Output
`get_catalogue_urls`	Catalogue URLs for the collection
`get_dataset_urls`	Dataset URLs for each catalogue
`get_feeds`	Feed info for each dataset
`get_feed_urls`	Feed URLs for each dataset
`get_opportunities`	Opportunity info for each feed

The outputs are stored in memory and also as cached files. The full function chain does not need to be run manually by the user, as any one function will cause the others before it to be run automatically, so if you just want the opportunity data then you can jump straight to get_opportunities and ignore the other functions. We'll explore the full data gathering chain here simply for a complete illustration.

First we import the required modules, and make a simple printer function to show nested data with nice indentation, which is the only reason why json is imported to begin with:

import app as oa
import json

def printer(arg):
    print(json.dumps(arg,indent=4))

Now get the collection of catalogue URLs:

>>> catalogueUrls = oa.get_catalogue_urls()
>>> printer(catalogueUrls)
[
    "https://opendata.leisurecloud.live/api/datacatalog",
    "https://openactivedatacatalog.legendonlineservices.co.uk/api/DataCatalog",
    "https://openactive.io/data-catalogs/singular.jsonld",
    "https://app.bookteq.com/api/openactive/catalogue"
]

Then get the dataset URLs for each catalogue:

>>> datasetUrls = oa.get_dataset_urls()
>>> printer(datasetUrls)
{
    "https://opendata.leisurecloud.live/api/datacatalog": [
        "https://api.activenewham.org.uk/OpenActive/",
        "https://booking.1life.co.uk/OpenActive/",
        "https://castlepoint.leisurecloud.net/OpenActive/",
        ...
    ],
    ...
}

Then get the feed info for each dataset:

>>> feeds = oa.get_feeds()
>>> printer(feeds)
{
    "https://opendata.leisurecloud.live/api/datacatalog": {
        "https://api.activenewham.org.uk/OpenActive/": [
            {
                "url": "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-session-series",
                "kind": "SessionSeries",
                "datasetName": "activeNewham Sessions and Facilities",
                "datasetPublisherName": "activeNewham",
                "discussionUrl": "https://github.com/gladstonemrm/activeNewham/issues",
                "licenseUrl": "https://creativecommons.org/licenses/by/4.0/"
            },
            ...
        ],
        ...
    },
    ...
}

Then get the feed URLs for each dataset:

>>> feedUrls = oa.get_feed_urls()
>>> printer(feedUrls)
{
    "https://opendata.leisurecloud.live/api/datacatalog": {
        "https://api.activenewham.org.uk/OpenActive/": [
            "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-session-series",
            "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-scheduled-sessions",
            "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-facility-uses",
            "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-slots"
        ],
        ...
    },
    ...
}

Then finally get the opportunity info for each feed. Note that an opportunity in the source data always has a field called "state", and the output opportunities from this program are those for which the value of this field is not "deleted". Usually this means that "state" has a value of "updated", in which case this field is not stored in the output opportunity info to avoid constant repetition. However, some outliers currently exist for which "state" in the source data is neither "deleted" nor "updated", in which case the "state" field is included in the output opportunity info in order to investigate further:

>>> opportunities = oa.get_opportunities()
>>> printer(opportunities)
{
    "https://opendata.leisurecloud.live/api/datacatalog": {
        "https://api.activenewham.org.uk/OpenActive/": {
            "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-session-series": [
                {
                    "id": "B2CLJNR16000816",
                    "kind": "SessionSeries",
                    "name": "Junior Gym Tues 4pm",
                    "latitude": 51.523460797563594,
                    "longitude": 0.02305090427398682
                },
                ...
            ],
            ...
        },
        ...
    },
    ...
}

Modifying the output

By default, the output at any stage has a nested structure that shows the data gathering path. The output can be either flattened to show only the terminal lists all tied together as one long list, or expanded to show metadata that includes the sub-list counts and time of last refresh. It is this latter form which is actually present in the variables passed between functions behind the scenes, and in the cached files too.

For the feed info and the opportunity info, the terminal lists contain dictionary elements, and we can choose to include the path URLs that form the outer dictionary keys of the default output structure if we wish. This is particularly useful when the output is flattened, as the path information would otherwise be obscured.

To do these flattening, metadata and path actions, we use the boolean doFlatten, doMetadata and doPath keyword arguments, respectively. Note that if both doFlatten and doMetadata are set to True, then the former takes precedence. Let's have a quick look at the catalogue data again, but this time with the metadata shown too:

>>> catalogueUrlsMeta = oa.get_catalogue_urls(doMetadata=True)
>>> printer(catalogueUrlsMeta)
{
    "metadata": {
        "counts": 4,
        "timeLastUpdated": "2023-02-22 19:21:45.322141"
    },
    "data": [
        "https://opendata.leisurecloud.live/api/datacatalog",
        "https://openactivedatacatalog.legendonlineservices.co.uk/api/DataCatalog",
        "https://openactive.io/data-catalogs/singular.jsonld",
        "https://app.bookteq.com/api/openactive/catalogue"
    ]
}

Now let's get the opportunity info again, but this time using the doFlatten and doPath keywords to flatten the structure into a single list of dictionaries, and to incorporate the path URLs into each dictionary:

>>> opportunitiesFlat = oa.get_opportunities(doFlatten=True, doPath=True)
>>> printer(opportunitiesFlat)
[
    {
        "id": "B2CLJNR16000816",
        "kind": "SessionSeries",
        "name": "Junior Gym Tues 4pm",
        "latitude": 51.523460797563594,
        "longitude": 0.02305090427398682,
        "catalogueUrl": "https://opendata.leisurecloud.live/api/datacatalog",
        "datasetUrl": "https://api.activenewham.org.uk/OpenActive/",
        "feedUrl": "https://opendata.leisurecloud.live/api/feeds/ActiveNewham-live-live-session-series"
    },
    ...
]

How many have we got?

>>> len(opportunitiesFlat)
44554

That's a lot of opportunities!

Note that in the output feed info and opportunity info, if a field is present in the source data then it is included in the output even if the field value is blank. Only if a field isn't present at all in the source data then it isn't included in the output. The alternative for the latter case would be to still include the field but with a blank value, but that would give a lot of wasted space. So it's worth noting here the full set of possible fields, seeing as inspection of any one dictionary in the above outputs won't necessarily contain all the options. For the feed info we have these options:

url
kind
datasetName
datasetPublisherName
discussionUrl
licenseUrl

Plus these extras if doPath is set to True:

catalogueUrl
datasetUrl

And for the opportunity info we have these options:

state (only included if it doesn't have the standard expected value of "updated")
id
kind
name
activityPrefLabel
activityId
latitude
longitude

Plus these extras if doPath is set to True:

catalogueUrl
datasetUrl
feedUrl (which is just "url" in the feed info)

Refreshing the cache

Finally, to refresh the output of any stage we can use the doRefresh keyword argument and set it to True. This refreshes the data cached in memory and in files, not only for the particular function to which the keyword is applied but for all those before it in the data gathering chain too. So, for example, if we refresh the get_dataset_urls function, then both the catalogue URLs and the dataset URLs will be refreshed, but not the feed info nor the opportunity info. But if we refresh the get_opportunities function then all data will be refreshed, as this function sits at the very end of the chain. The more of the chain that is refreshed, then the longer it will take, up to a few minutes in the case of get_opportunities seeing as it requires the most work.

Reikyo/harvester-py

Files

Running

Usage

Gathering the data

Modifying the output

Refreshing the cache