miku/metha

Client Timeout

tobiasschweizer opened this issue · 4 comments

Hi,

Is there a way to increase the client timeout?
I did a quick search for "timeout" but the only thing I found was:

I wanted to crawl Arxiv but found that existing tools would timeout.

:-)

We are harvesting quite a big collection using metha-sync and got

FATA[5443] Get "https://xyz.ch/request?resumptionToken=2022-05-01T00:00:00Z@2022-05-31T23:59:59Z@set_name@marc21@111111111&verb=ListRecords": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

For now, I have just started the process again.

Thanks for any hint!

miku commented

Yes, encountered similar things and timeout was an option internally, but not yet exposed as a flag. I added that in v0.2.32; maybe you could try it with something like:

$ metha-sync -T 5m https://xyz.ch

Hi @miku,

Great, thanks a lot for the quick reaction and releasing this so fast! :-)

I updated to 0.2.32 and am retrying with the new flags T and r. I'll keep you posted!

I did not get a timeout anymore. However, I got this

FATA[36306] read tcp xyz->abc: read: connection reset by peer

I think since the collection is very large and the harvesting takes very long the server is under some stress.

Are there best practices to automatically check whether the metha-sync harvesting process is still running and restart it if necessary?

miku commented

We had similar experiences, where e.g. the result set got too large for server and it bailed out (timeout on server side).

One workaround may be to shrink the time window for the harvest, e.g. from monthly (which is the default and seem to work for 99.5% of the cases) to daily (resulting in hopefully smaller result sets):

$ metha-sync -daily -T 5m https://abc.de

Note that this will harvest into the same directory and metha-cat just picks up any file in the harvesting directory (there is currently no -daily flag on metha-cat). Maybe best would be to stash the existing directory somewhere (if you do not want to lose the previous harvest):

$ mv $(metha-sync -dir https://abc.de) /tmp

[...] and then to start anew with -daily slices.


For automatic restart, I may try a shell wrapper first, something like:

$ until metha-sync -T 5m https://abc.de; do echo "retrying..."; sleep 3; done 

I believe metha returns non-zero on failure ("FATA"), so that should work (also, no half-harvested files should remain).