Improving CSW harvest
valentinedwv opened this issue · 6 comments
USGS sciencebase is a large collection.
Tried twice. Crashed at 32k and 245k records of 6000k.
Need a new techniques with large collections.
- ways to pass in a custom filter parameter, they have "collections" which can be used to get smaller sets
- resumable/restartable at a record count
https://my.usgs.gov/confluence/display/sciencebase/Catalog+Services
moving issue from catalog to here:
Esri/geoportal-server-catalog#67
does this also happen when harvesting into a local folder?
Was running to both a folder and server
With 6million records, was going to rewrite the folder to break it into ~1k blocks (or make an s3 store endpoint)
Assumed it's a connection to the csw server.
19-May-2017 12:36:53.488 INFO [HARVESTING] com.esri.geoportal.harvester.support.ProgressLogger.printStatusLog Harvesting of PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true progress: 141500
19-May-2017 12:38:28.398 SEVERE [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43 Error harvesting of PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:179)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43(DefaultProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Not Found
at com.esri.geoportal.commons.csw.client.impl.Client.readMetadata(Client.java:155)
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:174)
... 2 more
19-May-2017 12:38:28.398 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true | Error reading data.
com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:179)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$43(DefaultProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Not Found
at com.esri.geoportal.commons.csw.client.impl.Client.readMetadata(Client.java:155)
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.next(CswBroker.java:174)
... 2 more
19-May-2017 12:38:28.399 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportLogger.completed Completed processing task: PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
19-May-2017 12:38:28.399 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportStatistics.completed Harvesting of PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=https://www.sciencebase.gov/catalog/csw, cred-username=, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:SCIENCBASE], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=*****, gpt-cleanup=false], FOLDER[folder-root-folder=/opt/tomcat/webapps/metadata/, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true completed at Fri May 19 12:38:28 UTC 2017. No. succeded: 283135, no. failed: 2
One is server issue. dies at record 166666
https://www.sciencebase.gov/catalog/csw
<csw:GetRecords
xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
maxRecords="1"
startPosition="166666"
outputFormat="application/xml"
outputSchema="http://www.isotc211.org/2005/gmd"
resultType="results" service="CSW" version="2.0.2">
<csw:Query typeNames="csw:Record">
<csw:ElementSetName>full</csw:ElementSetName>
<csw:Constraint version="1.1.0">
<ogc:Filter xmlns:ogc="http://www.opengis.net/ogc" xmlns="http://www.opengis.net/ogc"
xmlns:gml="http://www.opengis.net/gml">
<ogc:PropertyIsLike escape="" singleChar="_" wildCard="%">
<ogc:PropertyName>AnyText</ogc:PropertyName>
<ogc:Literal>well</ogc:Literal>
</ogc:PropertyIsLike>
</ogc:Filter>
</csw:Constraint>
</csw:Query>
</csw:GetRecords>
Pull request #72 provides ability to define 'AnyText' literal for any CSW input broker.
search text filter implemented in harvester.