Esri/geoportal-server-harvester

Cannot Harvest FAA FeatureService (Maybe Due to Spaces in the Name)

rhodges opened this issue · 4 comments

We are trying to harvest this service:
https://services6.arcgis.com/ssFJjBXIUyZDrSYZ/ArcGIS/rest/services

We use "https://services6.arcgis.com/ssFJjBXIUyZDrSYZ/ArcGIS" for the endpoint, but if you open that in a browser, you are not redirected to services like normal.

When we try to harvest this AGS endpoint we see the following error in the logs:
Exception in thread "HARVESTING" java.lang.IllegalArgumentException: Illegal character in path at index 6: Buffer of Runways/FeatureServer

It appears the service "Buffer of Runways" has spaces in it.

Is this a valid service name, or is there another reason we cannot harvest?

I'm marking this as a bug. when navigating the service directory, this service shows up with the spaces in the URL replaced with %20:

https://services6.arcgis.com/ssFJjBXIUyZDrSYZ/ArcGIS/rest/services/Buffer%20of%20Runways/FeatureServer

if this is valid for ArcGIS Server service names the harvester should handle this case.

@zguo - please confirm this is resolved with Harvester 2.6.5

Hi @mhogeweg and @zguo -- I have updated to v2.6.5 (master branch for both catalog and harvester as of yesterday) and am trying again to harvest this service, but still no success.
Test/Result cases:

  • Hitting "https://services6.arcgis.com/ssFJjBXIUyZDrSYZ/ArcGIS" with 'ignore robots' checked:
    • fails with the error "Error listing server content." (in the harvester panel under "?")
  • Hitting "https://services6.arcgis.com/ssFJjBXIUyZDrSYZ" with 'ignore robots' checked:
    • Runs with 0 errors, but no harvested records (0/0) and no significant log activity I can find.
    • The same happens if I run with "/ArcGIS/rest/services" appended to the endpoint.
    • The same happens if I run without 'ignore robots' checked

I don't see anything of interest in the logs (at least not under /opt/tomcat/logs) -- the only logs getting updated from these tests are 'localhost_access_log.YYYY-MM-DD.txt':

10.0.2.2 - - [18/Nov/2021:19:10:35 +0000] "POST /harvester/rest/harvester/tasks/6251e3e0-e3ff-4116-8e54-077cbfa8f488/execute? HTTP/1.1" 200 1015
10.0.2.2 - - [18/Nov/2021:19:10:35 +0000] "GET /harvester/rest/harvester/triggers HTTP/1.1" 200 12
127.0.0.1 - - [18/Nov/2021:19:10:35 +0000] "POST /geoportal/oauth/token HTTP/1.1" 200 453
127.0.0.1 - - [18/Nov/2021:19:10:35 +0000] "POST /geoportal/elastic/metadata/item/_search?access_token=[ACCESS_TOKEN]&access_token=[ACCESS_TOKEN] HTTP/1.1" 200 139
10.0.2.2 - - [18/Nov/2021:19:10:35 +0000] "GET /harvester/rest/harvester/processes HTTP/1.1" 200 44984
10.0.2.2 - - [18/Nov/2021:19:10:35 +0000] "GET /harvester/rest/harvester/processes/fc38e4b6-399d-481d-92c5-067d09b57f4a HTTP/1.1" 200 1221
10.0.2.2 - - [18/Nov/2021:19:10:35 +0000] "GET /harvester/rest/harvester/processes/fc38e4b6-399d-481d-92c5-067d09b57f4a HTTP/1.1" 200 1221

it looks as if the server has the server directory browsing disabled or perhaps because this looks like an ArcGIS Online hosted services server. when going to https://services6.arcgis.com/ssFJjBXIUyZDrSYZ/ArcGIS the site returns an invalid URL response. this is what is causing the harvester to fail. however, https://services6.arcgis.com/ssFJjBXIUyZDrSYZ/ArcGIS/rest/services does list the services.