Harvest an OAI endpoint by fetching records one by one. This is a different strategy from the windowed and cached approach used in metha.
Install via go get or releases.
oaicrawl does not cache anything and will write the raw responses directly to standard output.
$ oaicrawl -f oai_dc -verbose http://www.academicpub.org/wapoai/OAI.aspx > harvest.data
...
DEBU[0025] fetched 1395 identifiers with 1 requests in 25.140568087s
Use the -b
flag to crawl in a best effort way (continue in the presence of
errors):
$ oaicrawl -b -f oai_dc -verbose http://www.academicpub.org/wapoai/OAI.aspx > harvest.data
...
... worker-11 backoff [3]: ..ertasacademica.com/5318&metadataPrefix=oai_dc
This crawler was written for working with endpoints that are slightly off-standard and cannot be harvested easily in chunks.
Test it yourself (might take a day to harvest completely):
$ oaicrawl -f mets -b -verbose http://zvdd.de/oai2/
...
$ oaicrawl -h
Usage of oaicrawl:
-b create best effort data set
-e duration
max elapsed time (default 10s)
-f string
format (default "oai_dc")
-retry int
max number of retries (default 3)
-verbose
more logging
-version
show version
-w int
number of parallel connections (default 16)