mwscrape.db in couch gets big
MHBraun opened this issue · 8 comments
updated mwscrape to the version including the --speed parameter.
Using
mwscrape de.m.wikipedia.org --speed 5 --delete-not-found
to replace individual 10 scrapes resulted in mwscrape.db was increased to 72,5 GB (!) after a couple of days running. Compacting manually resolved the size issue once.
I guess the compaction of mwscrape.db with --speed does not work properly albeit it worked correctly without --speed. (Sometimes quite aggressive, tough)
Probably allow growth of mwscrape to a specific (adjustable :) ? ) value like 256 MB. This would compact the mwscrape.db after around 5000 scrapes. The compacting load on the couchdb would not be excessive. With an adjustable value the harddisk space may be tweaked.
This would give the option to tweak the system according to space and speed capabilities
There is an option within couchdb settings to have automatic compaction, however this seems to run systemwide and not for a specific database.
I removed compact commands from within scraper since it was triggering way too many unnecessary compactions, there is no good way to know from within the scraper when database should be compacted and it shouldn't be scraper's concern. CouchDB can be configured to trigger compaction automatically when database reaches certain level of fragmentation. This can be configured for a specific database. Use that.
Compacting mwscrape may be problematic regardless of how it's triggered: under heavy write load compaction may very well never catch up and eventually will run out of space (as described in https://wiki.apache.org/couchdb/Compaction, see also https://issues.apache.org/jira/browse/COUCHDB-487). Changing number of revisions to keep from default 1000 to, say, 1 should help reduce mwscrape database size:
curl -X PUT -d 1 http://localhost:5984/mwscrape/_revs_limit
but perhaps session stats shouldn't be kept in CouchDB at all.
Thanks for the command.
I had the impression there is no compaction At all with --speed.
Managed to find a working solution:
I think heavy write load is far beyond our single user use is. I tested with up to 20 simultainiously scrapes (test only for a couple of minutes) and the couchdb did not show heavy load at all.
Couchdb seems to be made for hundreds or thousands write accesses.
I modified the configuration of the mwscrape.couch to be compacted every 5 Minutes if trash exceeds 30%.
Instead of
_default=
I used
mwscrape =
Restart of couchdb is needed.
Mwscrape is now growing to approximately 1GB and then is compacted to a couple of MBs. This is lightning fast. I can not follow in fouton (under the status tab). As a comparison the compaction of enwiki.couch takes 12h+
And there is a second compaction step for the scrapes which came in during this period.
As I said in the beginning my impression was the compaction of mwscrape.couch was broken.
With the change on couchdb side it is not needed anymore.
Mit TouchDown von meinem Android-Telefon gesendet (www.nitrodesk.com)
-----Original Message-----
From: itkach [notifications@github.com]
Received: Montag, 30 Mär. 2015, 1:27
To: itkach/mwscrape [mwscrape@noreply.github.com]
CC: MHBraun [mhbraun@freenet.de]
Subject: Re: [mwscrape] mwscrape.db in couch gets big (#8)
For clearliness these are the changes I applied in
/etc/couchdb/default.ini >> [compaction] >> _default = [{db_fragmentation, "70%"}, {view_fragmentation, "60%"}, {from, "00:00"}, {to, "06:59"}>> >> mwscrape = [{db_fragmentation, "90%"}, {view_fragmentation, "90%"}, {from, "00:00"}, {to, "23:59"}The trash size was increased to 90% which does not have an influence as with mwscraping and growing of mwscrape.couch we easily exceed 99,9% at a --speed 3 scrape in 300sec
the changes apply after restart of couchdb
sudo service couchdb restart
-----Ursprüngliche Nachricht-----
Von: "itkach" [notifications@github.com]
Gesendet: Mo. 30.03.2015 01:27
An: "itkach/mwscrape" [mwscrape@noreply.github.com]
Kopie: "MHBraun" [mhbraun@freenet.de]
Betreff: Re: [mwscrape] mwscrape.db in couch gets big (#8)
Compacting mwscrape may be problematic regardless of how it's triggered: under heavy write load compaction may very well never catch up and eventually will run out of space (as described in https://wikiapache.org/couchdb/Compaction, see also https://issues.apache.org/jira/browse/COUCHDB-487). Changing number of revisions to keep from default 1000 to, say, 1 should help reduce mwscrape database size:
curl -X PUT -d 1 http://localhost:5984/mwscrape/_revs_limitbut perhaps session stats shouldn't be kept in CouchDB at all.
—
Reply to this email directly or view it on GitHub.
-----Ursprüngliche Nachricht Ende-----
Alle Postfächer an einem Ort. Jetzt wechseln und E-Mail-Adresse mitnehmen! Rundum glücklich mit freenetMail
Stable working settings in couchdb:
_default
[{db_fragmentation, "70%"}, {view_fragmentation, "60%"}, {from, "00:00"}, {to, "06:59"}]
mwscrape
[{db_fragmentation, "95%"}, {view_fragmentation, "90%"}, {from, "00:00"}, {to, "23:59"}]
mwscrape.couch is growing to less than 200MB and is compacted.
I think the part:
, {from, "00:00"}, {to, "23:59"}]
can just be omitted. Without a schedule the compaction is done all the time if the first two parameters fit, see:
http://wiki.apache.org/couchdb/Compaction
Thanks for the hint.
My interpretation of
from and to: The period for which a database (and its view groups) compaction is allowed. The value for these parameters must obey the format: HH:MM - HH:MM (HH in [0..23], MM in [0..59])
was
that outside of the given period no compactions are allowed.
Was the same I thought first, but the first example in the documentation of compaction is clear I think. It would be useless setting there, because there is no schedule in first example.