Esri/geoportal-server-harvester

pycsw data.gov failing

Closed this issue · 10 comments

hi David, we have seen this CSW also has started failing in Geoportal Server 1.2.7. I have reached out to GSA (they operate Data.gov).

As a hint, neither of these two calls deliver any record content:

https://catalog.data.gov/csw?request=GetRecords&service=CSW&version=2.0.2&ElementSetName=full&typenames=csw:Record

https://catalog.data.gov/csw?request=GetRecords&service=CSW&version=2.0.2&ElementSetName=full&typenames=gmd:MD_Metadata

Any other combination of possible values of ElementSetName, typenames, outputformat, etc. doesn't help.

not to mention that these include many http-only links that will start to fail given the https-only policy in place for the federal government or when using Geoportal Server over https

Currently, harvester has a "Data.gov" dedicated input broker which allows to acquire metadata from that particular source. Since pycsw keeps failing, "Data.gov" broker is using a combination of CKAN and WAF methods to get job done. Broker itself requires no configuration beside giving a name; all other properties are optional.

So if we want to filter based on an organization, how might we do that?

cc @kalxas

@pandzel / @mhogeweg to clarify, what are the issues at hand? Is there a bug in how the CSW behaves? If there are bugs in pycsw please let us know and we can fix them accordingly.

hi @tomkralidis. Geoportal Server can do federated search to CSW and there are users who want to do this. We're seeing intermittent success. Most common issue appears to be 403 responses.

This request shows the response error:
https://gptogc.esri.com/geoportal/rest/distributed?rid=local&ridName=This%20Site&rids=local%2CdataGov&searchText=water&start=1&max=10&orderBy=relevance&f=atom

Exception when Posting CSW query to https://catalog.data.gov/csw-all: HTTP Request failed: HTTP/1.1 403 Forbidden

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML>
    <HEAD>
        <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
        <TITLE>ERROR: The request could not be satisfied</TITLE>
    </HEAD>
    <BODY>
        <H1>403 ERROR</H1>
        <H2>The request could not be satisfied.</H2>
        <HR noshade size="1px">
Request blocked.


        <BR clear="all">
        <HR noshade size="1px">
        <PRE>
Generated by cloudfront (CloudFront)
Request ID: fZtHiItoRuNkv0Wx9yrLUUXENvYEatwAkc0nyjPlEzOFmkKi0b2eJg==
</PRE>
        <ADDRESS></ADDRESS>
    </BODY>
</HTML>

The request is blocked

This is not a response from pycsw but from a proxy/caching layer in front of data.gov

we know. but we haven't been able to get GSA to resolve this