Define a federated Open Terms Archive collections APIs
Closed this issue · 27 comments
Context and Problem Statement
Open Terms Archive is a decentralised system that tracks collections of services and documents across multiple servers. Each collection operates its own API which exposes services and terms tracked, but the decentralisation of these APIs implies to search across all these APIs to identify which services and documents are currently tracked.
We propose the creation of a federated API to enable easy querying of the distributed database and thus facilitate collaboration with external applications.
Proposed solution
Base URL
http://api.opentermsarchive.org/:version
Endpoints
Note: The failures
object is detailed below in the Error Handling
section
GET /collections
Enumerate all collections
Returns
A JSON array of all collections
Example
GET /collections
[
{
"id": "collection-1",
"name": "Collections 1",
"languages": ["en"],
"jurisdictions": ["EU"],
"industries": {
"en": "Online intermediation services for businesses subject to the European <a href=\"https://eur-lex.europa.eu/eli/reg/2019/1150/oj\">platforms-to-businesses (“P2B” / 2019/1150) regulation</a>",
"fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen <a href=\"https://eur-lex.europa.eu/legal-content/FR/TXT/HTML/?uri=CELEX:32019R1150&from=EN\">P2B / 2019/1150</a>"
},
"url": "162.162.162.162",
"maintainers": [
{
"name": "Open Evidence",
"url": "https://open-evidence.com/"
},
{
"name": "European Commission",
"url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en"
}
],
},
{
"id": "collection-2",
"name": "Collections 2",
"languages": ["en"],
"jurisdictions": ["EU"],
"industries": {
"en": "Services needed to operate the Open Terms Archive engine",
"fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive"
},
"url": "162.162.162.162",
"maintainers": [
{
"name": "Open Terms Archive",
"url": "https://opentermsarchive.org"
}
],
}
]
GET /services?searchName=:searchName
Parameters
Parameter | Type | Description |
---|---|---|
searchName | URL-encoded string | The string to search for in service names |
Returns
A JSON array of all matching services accross all collections with the URL where they can be found.
Returns all services if no searchName
param is passed.
Returns an empty array if no matching service is found.
Example
GET /services?searchName=tube
{
"results": [
{
"collection": "demo",
"service": {
"id": "peartube",
"name": "PEARTUBE",
"url": "http://173.173.173.173/api/v1/service/peartube",
"termsTypes": [ "Terms of Service"]
}
},
{
"collection": "contrib",
"service": {
"id": "yourtube",
"name": "YourTube",
"url": "http://162.162.162.162/api/v1/service/yourtube",
"termsTypes": [ "Terms of Service", "Privacy Policy"]
}
}
],
"failures": []
}
GET /service/:serviceId
A JSON array of all specific service identified by their ID in all collections
Parameters
Parameter | Type | Description |
---|---|---|
serviceId | URL-encoded string | The ID of the service. |
Returns
A JSON array of services with the given ID accross all collections with the URL where they can be found.
Returns a HTTP 404
if no matching service is found.
Example
GET /service/service1
{
"results": [
{
"collection": "demo",
"service": {
"id": "service1",
"name": "Service 1",
"url": "http://173.173.173.173/api/v1/service/service1",
"termsTypes": [ "Terms of Service"]
}
},
{
"collection": "contrib",
"service": {
"id": "service1",
"name": "Service 1",
"url": "http://162.162.162.162/api/v1/service/service1",
"termsTypes": [ "Terms of Service", "Privacy Policy"]
}
}
],
"failures": []
}
Notes
Duplicates
We have considered multiple duplicate resolution solutions (specifying priority order as query params, defining an arbitrary priority based on data quality, returning an arbitrary result with a key alternatives
to other results, using HTTP code 300 Multiple Choices
, …) but we have come to the conclusion that they do not align with our fundamental philosophy of decentralization and resilience. The idea is therefore to embrace the fact that it is possible to have the same service declared in multiple collections and thus to always return an array of results.
Error Handling
To handle errors in the underlying APIs, the idea is to return a failures
array containing objects describing the collection that failed and why. For example:
{
"results": [
…
],
"failures": [
{
"collection": "demo",
"message": "The API service encountered an internal error while processing the request.",
},
{
"collection": "contrib",
"message": "The API is currently unreachable.",
}
]
}
Compatibility with different underlying API versions
By definition, a federated API may interact with multiple versions of underlying APIs. To effectively manage this, the proposed approach is to only gather the necessary fields and directly provide the resource URL in the underlying API. Moreover, to allow the client to determine the shape of the result, it is proposed to include the API version in the response headers of each underlying API.
Naming convention for collection ID
As the collection ID will then become a differentiating element that should be easy to handle with scripts and other tools, we suggest the following naming convention:
- Non-ASCII characters are not supported, they should be normalized into ASCII.
- Example:
france-élections
→france-elections
.
- Example:
- Capitals and spaces are not supported. It should be in lowercase and kebabcase (spaces are replaced with a dash
-
):- Example:
France Elections
→france-elections
.
- Example:
I have a note about duplicates: I think I agree that returning all results is the best way to go, but that still leaves the question of how we'd handle duplicates on the ToS;DR side. The RFC mentions "defining an arbitrary priority based on data quality" -- what is the criteria for "data quality" in this case? Does this mean that the result with the "highest" data quality would be returned?
Is there a real-life example of duplicates that I could inspect, just to see what the returned data might look like?
Thank you!
Hi @madoleary,
I have a note about duplicates: I think I agree that returning all results is the best way to go, but that still leaves the question of how we'd handle duplicates on the ToS;DR side.
The idea is to let each client of the federated API the responsibility to handle duplicates by returning all the results and letting it choose the collection from which it wants to obtain the document.
I think Open Terms Archive does not aim to be an intermediary that makes crucial choices for federated API clients, such as which collection should be more reliable than another.
The RFC mentions "defining an arbitrary priority based on data quality" -- what is the criteria for "data quality" in this case? Does this mean that the result with the "highest" data quality would be returned?
As it is mentioned, the idea of "defining an arbitrary priority based on data quality" was not retained, so a priori the question of data quality criterion will not be addressed on the OTA side.
Is there a real-life example of duplicates that I could inspect, just to see what the returned data might look like?
For example, a result for a query like GET /service/facebook
could look like this:
{
"results": [
{
"collection": "pga",
"service": {
"id": "facebook",
"name": "Facebook",
"url": "http://173.173.173.173/api/v1/service/facebook",
"termsTypes": [ "Terms of Service", "Privacy Policy", "Developer Terms", "Trackers Policy", "Data Processor Agreement"]
}
},
{
"collection": "contrib",
"service": {
"id": "facebook",
"name": "Facebook",
"url": "http://162.162.162.162/api/v1/service/facebook",
"termsTypes": [ "Terms of Service", "Privacy Policy"]
}
}
],
"failures": []
}
And on your side, you could define that you prefer to use data from the pga
collection because this collection is dedicated to tracking only gatekeepers with a high quality of maintenance whereas the contrib
collection has no clearly defined maintainers. Another element of choice for you could be that the pga
collection has more types of terms tracked for the Facebook
service. It's up to you 🙂.
Thanks @Ndpnt for this clear RFC!
Proposition 1.B
This is a suggested improvement of proposition 1 (initially posted) on GET /collections
.
GET /collections
The provided url
examples are just a hostname (162.162.162.162
). I believe they should be full-fledged URLs to the base endpoint of the API (http://162.162.162.162/api
) so that API calls can be programmatically written. We should also specify in the spec that it has no trailing slash.
[
{
"id": "collection-1",
"name": "Collections 1",
"languages": ["en"],
"jurisdictions": ["EU"],
"industries": {
"en": "Online intermediation services for businesses subject to the European <a href=\"https://eur-lex.europa.eu/eli/reg/2019/1150/oj\">platforms-to-businesses (“P2B” / 2019/1150) regulation</a>",
"fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen <a href=\"https://eur-lex.europa.eu/legal-content/FR/TXT/HTML/?uri=CELEX:32019R1150&from=EN\">P2B / 2019/1150</a>"
},
- "url": "162.162.162.162",
+ "url": "http://162.162.162.162/api",
"maintainers": [
{
"name": "Open Evidence",
"url": "https://open-evidence.com/"
},
{
"name": "European Commission",
"url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en"
}
],
},
{
"id": "collection-2",
"name": "Collections 2",
"languages": ["en"],
"jurisdictions": ["EU"],
"industries": {
"en": "Services needed to operate the Open Terms Archive engine",
"fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive"
},
- "url": "162.162.162.162",
+ "url": "https://api.ota.openmirrors.example/arbitrary/long/path",
"maintainers": [
{
"name": "Open Terms Archive",
"url": "https://opentermsarchive.org"
}
],
}
]
Proposition 2
This is an alternative to proposition 1 (initially posted) on GET /services?searchName=:searchName
GET /services/search?name=:searchName
My rationale is to prefer a /services/search
route with a ?name
query string, as this feels more future-proof with regards to future other routes: we don't reserve query parameters at /services
level, and avoid repeating search
as a query parameter name if we, for example, add support for searching by ID in the future, or support fuzzy search.
Parameters
| Parameter | Type | Description |
| --------- | ------ | ---------------------- |
- | searchName | URL-encoded string | The string to search for in service names |
+ | name | URL-encoded string | The string to search for in service names |
Returns
A JSON array of all matching services accross all collections with the URL where they can be found.
Returns all services if no name
param is passed.
Returns an empty array if no matching service is found.
Example
- GET /services?searchName=tube
+ GET /services/search?name=tube
{
"results": [
{
"collection": "demo",
"service": {
"id": "peartube",
"name": "PEARTUBE",
- "url": "http://173.173.173.173/api/v1/service/peartube",
+ "url": "http://162.162.162.162/api/v1/service/peartube",
"termsTypes": [ "Terms of Service"]
}
},
{
"collection": "contrib",
"service": {
"id": "yourtube",
"name": "YourTube",
- "url": "http://162.162.162.162/api/v1/service/yourtube",
+ "url": "https://api.ota.openmirrors.example/arbitrary/long/path/v1/service/yourtube",
"termsTypes": [ "Terms of Service", "Privacy Policy"]
}
}
],
"failures": []
}
I think that the ?name
query string is good suggestion
I have another question: what would the response object look like for an index of services? For example, if I were to retrieve all the services for each collection. I ask this because eventually Phoenix is supposed to retrieve an index of services from OTA, per the MOU. Let me know if this question is outside the scope of this RFC.
Also: is there a specific message returned when a service is not found?
Also: is there a specific message returned when a service is not found?
Sorry, I see the HTTP 404 note!
Thanks @MattiSG for your propositions.
I fully agree with the Proposition 1.B.
For proposition 2:
- I'm in favor of renaming the query string
name
. - For the route
/services/search
, I think having this route is less in line with the REST philosophy than/services?name=:searchName
.
REST encourages the use of URLs that represent resources which are represented by nouns whereas actions are represented by HTTP methods. Or with a route like/services/search?name=:searchName
, it really looks likesearch
is an action on the collection ofservices
resources. We could think of it as a resource but I think it's not what comes in mind firstly. I think it is more RESTful to think: "There is aservices
collection where I apply some filters", so a route like/services?name=:searchName
.
I have another question: what would the response object look like for an index of services? For example, if I were to retrieve all the services for each collection. I ask this because eventually Phoenix is supposed to retrieve an index of services from OTA, per the MOU. Let me know if this question is outside the scope of this RFC.
As I suggest to have the search action being only a filtering on the services
collection, for me the response object will look exactly the same. And if you need to retrieve all the services for each collection we could add a collection
query string to allow filtering on the collection ID as well.
Proposition 3
This is a suggested improvement on proposition one GET /services?name=:searchName
, initially posted as GET /services?searchName=:searchName
.
GET /services?name=:searchName&termsType=:termsType
The idea is to add the ability to query by termsType
, so that the results can be filtered by both service name and terms type. This is to avoid having to iterate through all service results and verify their termsTypes
fields at each iteration, just to locate a specific terms type within a specific service.
Details
Parameters
Parameter | Type | Description |
---|---|---|
name | URL-encoded string | The string to search for in service names |
termsType | URL-encoded string | The string to search for in service terms |
Returns
A JSON array of all matching services across all collections that also include the terms type, as indicated by the termsType
query param, in their termsTypes
fields.
Returns all matching services if no termsType
param is passed.
Returns an empty array if no matching service with the terms type is found.
Example
GET /services?name=facebook&termsType=cookies%20policy
{
"results": [
{
"collection": "contrib",
"service": {
"id": "facebook",
"name": "Facebook",
"url": "http://162.162.162.162/api/v1/service/facebook",
"termsTypes": ["Terms of Service", "Cookies Policy"]
}
}
],
"failures": []
}
Hi @madoleary,
Thanks for your proposition 3. I would make a minor changes by allowing to give multiple terms types like this:
Proposition 3.B
GET /services?name=:searchName&termsTypes=:termsType1,termsType2
Details
Parameters
Parameter | Type | Description |
---|---|---|
name | URL-encoded string | The string to search for in service names |
termsTypes | URL-encoded string | The comma-separated string that represent the array of termsType to search for |
Returns
A JSON array of all matching services across all collections that also include the terms types, as indicated by the termsTypes
query param, in their termsTypes
fields.
Returns all matching services if no termsTypes
param is passed.
Returns an empty array if no matching service with the terms types is found.
Example
GET /services?name=facebook&termsTypes=Cookies%20Policy,Terms%20of%Service
{
"results": [
{
"collection": "contrib",
"service": {
"id": "facebook",
"name": "Facebook",
"url": "http://162.162.162.162/api/v1/service/facebook",
"termsTypes": ["Terms of Service", "Cookies Policy"]
}
}
],
"failures": []
}
Love it!
I think it is more RESTful to think: "There is a
services
collection where I apply some filters"
💯
Thank you both for your contributions, I fully support 3.B!
Hi everyone,
This RFC received no further feedback since one month, so I think we can conclude that proposal 3.B seems acceptable to everyone and will therefore be implemented.
Thanks again for your contributions 🙏 .
Please note that we will probably not be able to work on its implementation before a few weeks as we have a lot of things to handle this month.
Thanks @Ndpnt!
It's not entirely clear to me what will be implemented: 3.B is concerned with GET /services?name=:searchName&termsTypes=:termsType1,termsType2
. What about GET /service/:serviceId
(proposition 2? With your further amendments?) and GET /collections
(1 or 1.B?)? 🤔 What is the final proposed API layout?
Proposed final API layout:
GET /collections
Returns
A JSON array of all collections
Example
GET /collections
[
{
"id": "collection-1",
"name": "Collections 1",
"languages": ["en"],
"jurisdictions": ["EU"],
"industries": {
"en": "Online intermediation services for businesses subject to the European <a href=\"https://eur-lex.europa.eu/eli/reg/2019/1150/oj\">platforms-to-businesses (“P2B” / 2019/1150) regulation</a>",
"fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen <a href=\"https://eur-lex.europa.eu/legal-content/FR/TXT/HTML/?uri=CELEX:32019R1150&from=EN\">P2B / 2019/1150</a>"
},
"url": "http://162.162.162.162/api",
"maintainers": [
{
"name": "Open Evidence",
"url": "https://open-evidence.com/"
},
{
"name": "European Commission",
"url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en"
}
],
},
{
"id": "collection-2",
"name": "Collections 2",
"languages": ["en"],
"jurisdictions": ["EU"],
"industries": {
"en": "Services needed to operate the Open Terms Archive engine",
"fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive"
},
"url": "https://api.ota.openmirrors.example/arbitrary/long/path",
"maintainers": [
{
"name": "Open Terms Archive",
"url": "https://opentermsarchive.org"
}
],
}
]
GET /services?name=:searchName&termsTypes=:termsType1,termsType2
Details
Parameters
Parameter | Type | Description |
---|---|---|
name | URL-encoded string | The string to search for in service names |
termsTypes | URL-encoded string | The comma-separated string that represent the array of termsType to search for |
Returns
A JSON array of all matching services across all collections that also include the terms types, as indicated by the termsTypes
query param, in their termsTypes
fields.
Returns all matching services if no termsTypes
param is passed.
Returns an empty array if no matching service with the terms types is found.
Example
GET /services?name=facebook&termsTypes=Cookies%20Policy,Terms%20of%Service
{
"results": [
{
"collection": "contrib",
"service": {
"id": "facebook",
"name": "Facebook",
"url": "http://162.162.162.162/api/v1/service/facebook",
"termsTypes": ["Terms of Service", "Cookies Policy"]
}
}
],
"failures": []
}
GET /service/:serviceId
Parameters
Parameter | Type | Description |
---|---|---|
serviceId | URL-encoded string | The ID of the service. |
Returns
A JSON array of services with the given ID accross all collections with the URL where they can be found.
Returns a HTTP 404
if no matching service is found.
Example
GET /service/service1
{
"results": [
{
"collection": "demo",
"service": {
"id": "service1",
"name": "Service 1",
"url": "http://173.173.173.173/api/v1/service/service1",
"termsTypes": [ "Terms of Service"]
}
},
{
"collection": "contrib",
"service": {
"id": "service1",
"name": "Service 1",
"url": "http://162.162.162.162/api/v1/service/service1",
"termsTypes": [ "Terms of Service", "Privacy Policy"]
}
}
],
"failures": []
}
Much clearer, thank you very much! 😃
In 3.B (#1016 (comment)), we did not specify if specifying multiple terms types means we want to get only the service declarations that track all those terms types, or if we want to get all service declarations that track at least one of those terms types 🙃
@Ndpnt you were the one expanding on @madoleary’s initial request, to include multiple terms types. Do you remember what was your intention with this addition?
We also did not specify what happens if /services
is called with no parameter at all. I suggest it sends a 400 Bad Request
error, as we don't want the federated API to proceed with aggregating every existing declaration.
In 3.B (#1016 (comment)), we did not specify if specifying multiple terms types means we want to get only the service declarations that track all those terms types, or if we want to get all service declarations that track at least one of those terms types 🙃
@Ndpnt you were the one expanding on @madoleary’s initial request, to include multiple terms types. Do you remember what was your intention with this addition?
My intention was to make it possible to search for a service containing at least the specified terms types, in order to help me find the most appropriate collection for the terms types I was interested in. So for me, it was an AND logical operator for terms types.
We also did not specify what happens if
/services
is called with no parameter at all. I suggest it sends a400 Bad Request
error, as we don't want the federated API to proceed with aggregating every existing declaration.
I don't agree with that, I'm in favor of returning all the services. At the moment, we don't have too many services, and when we do, we'll be able to set up pagination. It's important to bear in mind that this means just one request to each collection API and not a request per service.
After some discussion, it seems that we don't currently have a use case for searching with multiple term types on /services
, so we'll revert to a single termsType parameter.
If we have no results
but all collections have failures
, is that still a 404 or is that a 502 at some point? 🤔
it seems that we don't currently have a use case for searching with multiple term types
Complement note: we also found that all hypothetical use cases (AND, OR) could be implemented with the basic function provided here and a tiny bit of client-side logic. It will always be time to add more power to the API later on when we gather more understanding of most usual use cases 🙂
I don't agree with that, I'm in favor of returning all the services. At the moment, we don't have too many services
After discussion I agree, this was premature optimisation on my side. This “no parameter” route is very easy to cache. If it becomes very popular and the contents grow big, we can just decrease the poll rate and warn that this route only updates every hour / every day…
Hi all, I appreciate the discussion about multiple terms types. In my specs, I only have us searching for one terms type at a time, e.g., cookies policy. I, too, don't think searching for multiple terms types is necessary. I also think all services should be returned on /services
. I think that's more like the RESTful behavior I've seen.