okfn/ibp-explorer

Data from additional API endpoint

pwalsh opened this issue · 10 comments

Description

Tasks

  • Check the new API
    • Ensure data structure is exactly the same, and if there are any discrepancies, list them here
  • Work out how we add this to our client code, and call different data from different APIs!
    • Provide an estimate of extra work incurred by the data coming from multiple APIs

There are multiple issues with the new API:

  • India, Italy, Malaysia and Norway are present in 8 objects. Also the extra 7 objects are irregular because they contain only one document state in obi.availability['2016'] instead of all 8.
  • All new countries doesn't contain the library link (this may be fine if all countries names are same on Drive and in the API).
  • The Google Drive library for these countries contains files (for previous years) that are not available in the API.
  • QUESTION: Can we have google file ids or full path to documents when they are uploaded so we doesn't manually match the document filenames so we can solve the problem when same filename is encountered in multiple folders in the Drive?
  • Armenia, Côte d’Ivoire, Greece and Palestine doesn't contain obi object.
  • Bolivia, Congo and Tanzania doesn't contain 2016 object in obi.availability (are these countries discontinued?)

The new countries still doesn't contain obi_scores, documents and snapshots so we can't check on that.

About the question of using the two APIs I think that is not required. For all I have seen the new API provides the same content but updated, so I fail to see why we should implement two APIs in the tracker.

As a mentioned previously the API will not do snapshots, so I'm opening new issue on that.

an example where old and new APIs contain the same data: http://pastebin.com/uDHefX1r

Blocked currently - we are waiting on fixes for the API form the API provider.

Checked the fixes from the API provider (only the countries endpoint). In the comment I will refer to the APIs with the domain name. So the first API we we're using (historic Aquarium and Spring/Summer 2016 data) will be amida-tech and the new one (2017 data) will be indabaplatform. There are new issues with the data:

  • I'm unable to tell which API will we continue to use because both APIs got updated.

    amida-tech now has 105 unique countries
    indabaplatform has 117 unique countries

    These countries are present in indabaplatform and not in amida-tech:

    • Australia
    • Canada
    • Burundi
    • Congo, Dem. Rep.
    • Comoros
    • Japan
    • Moldova
    • Liberia
    • Korea, Rep.
    • Paraguay
    • Slovak Republic
    • Venezuela, RB
    • Cote d'Ivoire

    Turns out some of this countries are present with different names in indabaplatform:

    • Côte d’Ivoire - Cote d'Ivoire
    • Dem. Rep. of Congo - Democratic Republic of Congo
    • Slovak Republic - Slovakia
    • Venezuela - Venezuela, RB

    Other observed errors:

    • One entry of Liberia contains a trailing whitespace.
    • Somalia is present in amida-tech and not in indabaplatform (there was no mention of dropping Somalia)
  • Duplicate entries are present in both of the APIs.

    • amida-tech
      • Somalia 8 times
    • indabaplatform
      • India 8
      • Italy 8
      • Malaysia 8
      • Norway 8
      • Australia 8
      • Canada 8
      • Burundi 8
      • Congo, Dem. Rep. 8
      • Comoros 3
      • Japan 8
      • Moldova 5
      • Liberia 8
      • Korea, Rep. 8
      • Paraguay 8
      • Slovak Republic 8
      • Venezuela, RB 8
      • Cote d'Ivoire 8

@dumyan

Ok.

Please confirm the following before I write to the client:

  1. Both APIs have been updated, which is confusing, as while originally we were told we'd use both, the most recent information was that the indabaplatform API will now replace the amida-tech API.
  2. The amida-tech API has now had a regression, and has bugs (those duplicates) that were not present before. Based on latest info, this should not concern us, as we were told that.indabaplatform will now replace amida-tech, but, the fact that this API was actually updated means we should check what happened (as our current codebase uses the amida-tech API, until the bugs with indabaplatform get sorted, which is what we are currently waiting on).
  3. One of the main issues we've been waiting on for 2 weeks was a fix for the duplicate country entries in the new indabaplatform. Not only has that not been fixed, there are actually now more countries with duplicate entries in the API.
  4. We've noticed some whitespace Liberia in indabaplatform which needs to be fixed.
  5. We've noticed that Somalia is present in amida-tech but not in indabaplatform, which needs to be clarified as there has been no mention of dropping Somalia.

Wow.

Also @dumyan I need you to check other endpoints before I write, so we can minimise the back and forth.

@pwalsh your summary is correct.

Since the reports endpoint isn't used in our code and we are implementing our snapshots, the only thing left to check is the documents endpoint.

Update status

amida-tech documents endpoint is updated and indabaplatform documents endpoint is not.

  • Aquarium contained 1168 documents
  • indabaplatform contains 1168 documents
  • amida-tech contains 2415 documents

This means that there are astonishing 1247 new document objects are present. I think this need to be discussed between IBP and Amida, because I have no ways to know if all the documents are actually valid.

Missing fields

[document].type

There is no way to match a specific document with a document type for a country.

This is an example of the new document object:

{
    "id": "5817576452f68729181970ca",
    "comments": "No comment for this question",
    "country": "Georgia",
    "countryCode": "GE",
    "filename": "TAVI_VI.pdf",
    "attachmentId": 657,
    "year": "2016",
    "url": "https://indaba-prod.s3-us-west-2.amazonaws.com/ibp/be9cfe439349d1a7271079e6761fcaf5?AWSAccessKeyId=AKIAJYQ5RUYQ6RF2TT6Q&Expires=1477668797&Signature=EA8BXl6PoyILwFAZdG5BTk%2F4UxE%3D"
  }

This is an example of an old document object:

{
    "id": "56efbb7a8812ab0300000001",
    "type": "In-Year Report",
    "title": "Communication en Conseil des Ministres relative à l'exécution du Budget à fin décembre 2015",
    "available": true,
    "internal": false,
    "year": "2015",
    "comments": "Le relooking du site du ministère a fait que nous avons eu quelques difficultés pour retrouver l'emplacement du document sur le site.",
    "comments_public": null,
    "location": "Website",
    "location_detail": "",
    "url": "http://budget.gouv.ci/sites/default/files/publications/ccm_execution_budgetaire_a_fin_decembre_2015.pdf",
    "date_published": "21 Mar 2016",
    "date_received": "21 Mar 2016",
    "softcopy": true,
    "scanned": false,
    "country": "Côte d’Ivoire",
    "countryCode": "CI",
    "last_modified": "2016-03-21T09:21:49.201Z",
    "created_at": "2016-03-21T09:14:34.810Z",
    "uploads": [
      {
        "name": "ccm_execution_budgetaire_a_fin_decembre_2015.pdf",
        "filename": "56efbb7a8812ab0300000001/ccm_execution_budgetaire_a_fin_decembre_2015.pdf"
      }
    ],
    "approved": null
  }

Previously we have been able to match documents by year and document type and country to display detailed info for the document (for example comments, date etc). With the new document object we have no way to tell which document belongs to which document type (In-Year Report, Year-End Report etc)

[document].approved

Currently the tracker only displays documents that are marked as approved. Since there isn't a single new document that contains that property, does that mean that all new documents we get from the API should be considered as approved? (we will continue to check for this property for the old documents though)

[document].title

The document title is used in many places in the tracker.

[document].available, [document].internal and [document].date_published

These fields are currently used to determine the document state following this code inherited from Aquarium:

function getDocumentState(doc) {
  if (doc.available && !doc.internal && doc.date_published) {
    return 'available';
  }
  if (!doc.available && doc.internal && !doc.date_published) {
    return 'internal';
  }
  if (!doc.available && !doc.internal && doc.date_published) {
    return 'late';
  }

  return 'not produced';
}

We won't need these properties if we can rely on the [country].obi.availability.[year].[type] property (from the country endpoint).

Also [document].date_published is displayed on the tooltips in the status page in the tracker.

Various questions:

Country code for Korea is null, is this intentional?

From API provider:

  • We've addressed all blocking issues to our knowledge for OKF. Any remaining inconsistencies, as far as we know, are minor and stem from the original Aquarium data.
  • "Internal" now has a default value of 'null'. "Approved" has a default value of true. We need to confirm with OKF and IBP that a true default value is fine. We can make it null if they prefer, but we don't have the actual data stored to give a true or false status.
  • [document].type has been added meaning that they should be able to correlate documents now. Given this statement from OKF: "We won't need these properties if we can rely on the [country].obi.availability.[year].[type]property (from the country endpoint)."
  • Although instead of country.name, I would suggest using the country.code and document.countryCode keys to match up document to actual country.
  • We don't have formatted data for these keys in the system that we know of.
  • As for [document].title -- the closest thing we have is the filename. This API does not currently have the capabilities to parse the remote document and display the document title.
  • We've added most all of the fields from the legacy data and populated them with data in our system where possible. The rest contain a value of null or in [document].approved's case it's a default of true.

Comments/Issues:

"Internal" now has a default value of 'null'. "Approved" has a default value of true. We need to confirm with OKF and IBP that a true default value is fine. We can make it null if they prefer, but we don't have the actual data stored to give a true or false status.
and also:
We've added most all of the fields from the legacy data and populated them with data in our system where possible. The rest contain a value of null or in [document].approved's case it's a default of true.

I can't give any input on the approved value, this should get resolved with IBP as they are doing the process of approvement.

Having null values is really unnecessary because we as with the previous case we don't have the actual metadata about the documents and we are just adding to the data size.

For our purposes, the simplest way that we can overcome this issue is to have a state field which needs to be have the [country].obi.availability.[year].[type] value.

I thought I could easily fix this and avoid any additional changes in the API by matching [document].type with the corresponding [country].obi.availability.[year].[type]. But unfortunately I have found out there are typos and plain errors in the [document].type fields. The correct values for a document types are:

  • Pre-Budget Statement
  • Executive's Budget Proposal
  • Enacted Budget
  • Citizens Budget
  • In-Year Report
  • Mid-Year Review
  • Year-End Report
  • Audit Report

Full list of documents with invalid types here: http://pastebin.com/EXxSXvTZ

Some of the typos are present in the [country].obi.availability as well. Citizen's Budget should get renamed to Citizens Budget and Mid-Year-Review to Mid-Year Review.

What we ended up with:

  • We are using only indabaplatform as a single source of truth as it contains all data that we need
  • We resorted to using [country].obi.availability to find out the state of the documents and only try to find out the state from the documents if [country].obi.availability[year] is not available.
  • We are using the [document].filename as a [document].title because this is the closest thing we have
  • available, internal and date_published won't be available from the API.

I'm considering this as DONE

agreed. done.