Esri/geoportal-server-harvester

Harvester not removing content from geoportal that has been removed from source WAF

MikeRoyer-NOAA opened this issue · 5 comments

Harvester 2.6.4
A harvester task is set up to pull from a WAF and some XML files that have been removed from the source WAF are not being removed from the geoportal. The harvester history for the task reports it acted upon 14537 xml files and the geoportal reports that it has 14852 items in the source of origin (i.e. harvester task). The tasks is not run incrementally. Isn't harvester supposed to remove anything that is not in the source WAF from the geoportal when the task runs?

have you set the Geoportal output broker to 'perform cleanup'? That is what determines if the harvester will attempt to remove existing items from Geoportal.

Yes, the Harvester output broker has the "Perform cleanup" checked. Does it perform cleanup every time a Harvester task is run or on some frequency?

It should do it every time it runs a task. Is your WAF public? I can do some testing on my end

I'm checking with my user base on whether the WAF is public.

In the meantime, can you tell me what the Failed (in/out) column on the history page means. Does "Failed in" mean that the xml is not formed properly and "Failed out" mean that there is some issue with the content within the XML and both situations are not loaded into the output broker?

I added some info on the history page in the wiki: https://github.com/Esri/geoportal-server-harvester/wiki/Tasks#history