Esri/geoportal-server-harvester

Improving CSW harvest

valentinedwv opened this issue · 6 comments

USGS sciencebase is a large collection.
Tried twice. Crashed at 32k and 245k records of 6000k.
Need a new techniques with large collections.

  • ways to pass in a custom filter parameter, they have "collections" which can be used to get smaller sets
  • resumable/restartable at a record count

https://my.usgs.gov/confluence/display/sciencebase/Catalog+Services

moving issue from catalog to here:
Esri/geoportal-server-catalog#67

does this also happen when harvesting into a local folder?

Was running to both a folder and server
With 6million records, was going to rewrite the folder to break it into ~1k blocks (or make an s3 store endpoint)

Assumed it's a connection to the csw server.

19-May-2017 12:36:53.488 INFO [HARVESTING] com.esri.geoportal.harvester.support.ProgressLogger.printStatusLog Harvesting of PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true progress: 141500
19-May-2017 12:38:28.398 SEVERE [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43 Error harvesting of PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
 com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:179)
        at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43(DefaultProcessor.java:136)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Not Found
        at com.esri.geoportal.commons.csw.client.impl.Client.readMetadata(Client.java:155)
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:174)
        ... 2 more

19-May-2017 12:38:28.398 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true | Error reading data.
 com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:179)
        at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43(DefaultProcessor.java:136)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Not Found
        at com.esri.geoportal.commons.csw.client.impl.Client.readMetadata(Client.java:155)
        at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:174)
        ... 2 more

19-May-2017 12:38:28.399 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportLogger.completed Completed processing task: PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
19-May-2017 12:38:28.399 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportStatistics.completed Harvesting of PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true completed at Fri May 19 12:38:28 UTC 2017. No. succeded: 283135, no. failed: 2

One is server issue. dies at record 166666

https://www.sciencebase.gov/catalog/csw

<csw:GetRecords
xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
maxRecords="1"
startPosition="166666"

outputFormat="application/xml"
outputSchema="http://www.isotc211.org/2005/gmd"
resultType="results" service="CSW" version="2.0.2">
    <csw:Query typeNames="csw:Record">
        <csw:ElementSetName>full</csw:ElementSetName>
        <csw:Constraint version="1.1.0">
            <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc" xmlns="http://www.opengis.net/ogc"
            xmlns:gml="http://www.opengis.net/gml">
                <ogc:PropertyIsLike escape="" singleChar="_" wildCard="%">
                    <ogc:PropertyName>AnyText</ogc:PropertyName>
                    <ogc:Literal>well</ogc:Literal>
                </ogc:PropertyIsLike>
            </ogc:Filter>
        </csw:Constraint>
    </csw:Query>
</csw:GetRecords>

Pull request #72 provides ability to define 'AnyText' literal for any CSW input broker.

zguo commented

search text filter implemented in harvester.