IATI/IATI-Datastore

Data for various donors not being updated in IATI datastore, update failing silently

Closed this issue · 20 comments

From IATI Registry package worldbank-bd, the original file contains transactions in 2017 for activity 44000-P090807 but the IATI Datastore call for that activity does not contain these most recent transactions.

about/dataset/worldbank-bd suggests that the dataset was successfully retrieved but was not parsed:

{
    "dataset": "worldbank-bd",
    "last_modified": "2017-07-07T03:20:43.638392",
    "num_resources": 1,
    "resources": [
        {
            "last_fetch": "2017-07-08T05:38:28.737314",
            "last_parsed": "2017-05-16T07:55:09.734309",
            "last_status_code": 304,
            "last_successful_fetch": "2017-07-08T05:38:28.737340",
            "num_of_activities": 49,
            "url": "http://siteresources.worldbank.org/IATI/WB-BD.xml"
        }
    ]

}

However, error/dataset/worldbank-bd does not reveal any errors:

{

    "errors": [ ]

}

yeah, so:
http://datastore.iatistandard.org/api/1/access/activity.xml?iati-identifier=44000-P156823

and:
http://datastore.iatistandard.org/api/1/access/activity.xml?iati-identifier=44000-P159683

…should return stuff, since they’re in the source data.

Also note: This appears to be is true for lots of/ all worldbank datasets.

Note that the datastore doesn’t reparse data when a 304 is returned, because it trusts the server that nothing has changed (aside: perhaps the datastore should make a head request if it’s going to do this?) Anyway, something has gone a bit wrong here, because the data in the datastore has been stale for over a month, but the datastore is getting a 304 from the worldbank server.

This could be due to:

  • the etag used in the If-None-Match header (although that’s probably unrelated for reasons)
  • the last_successful_fetch timestamp used in the If-Modified-Since header
  • the datastore failed to parse the data, but failed silently
  • …or it could be a mistake at the worldbank server end

It’s probably easier to debug by looking at the data stored on the resource model in the datastore. But anyway, the reason for the 304 is at least partly due to the timestamp used in the If-Modified-Since header.

I’m a bit suspicious about https://github.com/IATI/IATI-Datastore/blob/5b871aa3/iati_datastore/iatilib/crawler.py#L164-L165 … I wonder if those lines should be:

    if resource.last_parsed:
        headers['If-Modified-Since'] = http_date(resource.last_parsed)

i.e. s/last_succ/last_parsed/ UPDATE: I think last_succ is correct… I don’t think that’s the problem here.

Thanks for @markbrough and @andylolz for your work reporting this issue and debugging. We've added this to the list of maintenance jobs for next week and it will be looked at this coming Monday. We will post a further update then.

As an update to this, it does seem that the dataset in question has not been correctly parsed by the Datastore. I have rebooted the server hosting the application, as this may be enough to give things a shove when it next attempts an update.

This might not actually fix things, but it's a good first step to try. We'll keep this ticket open and provide an update later in the week if this has been successful. Meanwhile, we have some work to do on the dev server, as an OS upgrade has created an error there. Fixing that will enable us to explore this issue further, if it does indeed persist later in the week.

In addition to this, I have restarted the datastore import process from scratch on the dev datastore: http://dev.datastore.iatistandard.org/api/1/access/activity.xml

We'll give this some hours to fully regenerate and see if the problem occurs there too. If the data appears correctly on this dev version, then this is definitely a problem when re-parsing existing datasets.

This doesn't appear to have fixed the issue… the activities @andylolz highlighted above do not appear on either the production server or the dev server

This looks partially fixed… The activities I mentioned (this one and this one) both appear on the datastore now. But it appears there are still 80 worldbank packages that silently fail to parse. For instance, this one:
http://datastore.iatistandard.org/api/1/about/dataset/worldbank-af

Oop! All worldbank activity packages have now been recently parsed 🎉 🎉 🎉 As such, I guess this is closable! Can you verify, @markbrough?

@dalepotter: Did you do something to force a complete crawl? I’m not sure how this ended up being fixed…

The more recent transactions for World Bank projects have now appeared in the Bangladesh AIMS. But it would be useful to understand what has been done to get the Datastore working again and how we can make sure that when AIMS are trying to use IATI data in future they can have reliable and realistic expectations about what to expect of IATI infrastructure.

@andylolz @markbrough Thought I'd posted an update to this one, maybe I didn't actually press 'Comment'!

The root of this issue seems to be related to a change to the Registry, which we applied this fix for #272 (thanks to @andylolz for that one). From there, we were able to deploy straight to live, reran the crawler manually and rebooted the server to ensure that the fetch process was kickstarted from scratch.

Regarding the reliability of the Datastore, it is a product that will unfortunately remain in alpha for the foreseeable future. However, we have a current working pattern which seeks to resolve the highest priority bugs as soon as they arise (plus make small, incremental improvements current tools where possible), whilst also trying to set good foundations for the long term (through the architecture proposal and development of the python library). From there, we'll be refactoring/improving end-user tools, starting with validation functionality, and then an improved website and later looking at the future of the Datastore (building on user research from the TAG2017).

I appreciate that this doesn't help ensure the long-term reliability of the Datastore right now - it's something we'd love to improve but with the sheer number to tools we run, developer capacity means we're constantly juggling priorities.

trying to set good foundations for the long term

This is great but it sounds very bottom-up. Have you considered a more Agile approach? It would be great to have something to demonstrate to end users (who are either using IATI services right now, or considering using them) that progress is being made. I’m concerned that existing users won’t see any benefit at all until you’re done.

building on user research from the TAG2017

Yeah – the sessions about improvements to existing IATI services were really well-attended at TAG. I imagine there were a bunch of user-facing improvements suggested there that could be applied right now, and that would not be wasted effort in the longterm (or at worst, any wasted effort would be minimal), and that could be applied in conjunction with the current bottom-up approach.

This problem has re-occurred with Netherlands data. See package minbuza_nl-activities20162017 in this file:
http://datastore.iatistandard.org/api/1/about/dataset/minbuza_nl-activities20162017

The activity XM-DAC-7-PPR-4000000013 does not appear in the IATI Datastore: http://datastore.iatistandard.org/api/1/access/activity.xml?iati-identifier=XM-DAC-7-PPR-4000000013

It appears that the file was last fetched in July, though the data has been published every month since then: http://datastore.iatistandard.org/api/1/about/dataset/minbuza_nl-activities20162017

The continuation of this problem is presenting significant challenges for using the data in Bangladesh. Please can this issue urgently be given attention.

ccing @mijaved @dfaruque

It appears that the file was last fetched in July, though the data has been published every month since then:

As noted by @allthatilk, the Datastore only updates datasets that the Registry deems to have changed since the Datastore last updated the data.

Looking at the Registry page for the Dataset in question, it is stated that the Registry last updated on 2017-07-19. The Datastore's update date of 2017-07-25 is after this. As such, the Datastore is working as expected and this is a problem with the Registry.

IATI/ckanext-iati#125 looks to be the relevant issue for the problem in question, though I will double check with @dalepotter

The problem is with the registry so this issue is being closed.

It’s your issue… But I object a bit to this being tagged invalid. The original bug report (re. worldbank data) was not invalid. I did some work to fix it.

I really object to this. I have again raised a problem with the datastore that is significantly impacting usability of data in Bangladesh. It has occurred with World Bank data and now again with Netherlands data. You think this bug will be resolved once you fix a problem with the Registry, but you don't have any way of tracking this feedback and knowing that has actually resolved this case if you close this issue. Why would you close it and tag it as invalid? I don't see why it is relevant that there is a problem that lies elsewhere -- the datastore does not perform as expected -- it does not return the requested data. This issue should remain open until it is actually resolved.

To be honest, it is embarrassing to have to keep explaining to people here in Bangladesh that their data is fine, the government’s software is fine, but it is IATI infrastructure that is preventing their data from being automatically imported and updated.

IATI/ckanext-iati#125 looks fixed now… And the example @markbrough provides above (this NL data) is now appearing on the datastore. So things are certainly looking positive here!

The Registry has gone down a number of times this week so it may affect the Datastore data in the immediate future. Good to see things improving though as we are in regular communication with Viderum to improve the situation.

The Registry has gone down a number of times this week so it may affect the Datastore data in the immediate future

Related to this, I’ve reworded the title of #279.

This issue has reoccurred – data in the datastore is not being updated, and there’s no indication of a problem.

Could this be reopened or a new issue created, @IATI?

See also: #271.