CottageLabs/OpenArticleGauge

high celery memory usage

Opened this issue · 4 comments

This bugger doi:10.2119/molmed.2011.00222 points to a 24 MB pdf, which ratchets up the memory of the worker processing it 4-5 times to ~800 MB. For some reason.

In a testament to the system's persistence, restarting the worker just resulted in another one picking up the task and going to 800 MB of RAM. When it works, it works really well...

The Generic String Matcher may play a role here as well, as I noted the CPU also increased 5-10x since our latest changes. This is OK, plenty of room for optimisation there. It was important to get it working first. I don't expect it to use more RAM though, that is weird.

In the end a better machine might be the solution too long-term. I'm not a fan of performance-wise sloppy code, but we are already squeezing out a lot of complex operations in ~2 seconds per identifier for thousands of identifiers out of a very moderately sized single server, by modern web dev standards.

We could place a limit on the size of articles that we'll download.
Something reasonable, to ensure we get most stuff, but small enough to
avoid these kinds of issues?

On 12 April 2014 16:57, Emanuil Tolev notifications@github.com wrote:

This bugger doi:10.2119/molmed.2011.00222 points to a 24 MB pdf, which
ratchets up the memory of the worker processing it 4-5 times to ~800 MB.
For some reason.

In a testament to the system's persistence, restarting the worker just
resulted in another one picking up the task and going to 800 MB of RAM.
When it works, it works really well...

The Generic String Matcher may play a role here as well, as I noted the
CPU also increased 5-10x since our latest changes. This is OK, plenty of
room for optimisation there. It was important to get it working first. I
don't expect it to use more RAM though, that is weird.

In the end a better machine might be the solution too long-term. I'm not a
fan of performance-wise sloppy code, but we are already squeezing out a lot
of complex operations in ~2 seconds per identifier for thousands of
identifiers out of a very moderately sized single server, by modern web dev
standards.


Reply to this email directly or view it on GitHubhttps://github.com//issues/71
.

Richard Jones,

Founder, Cottage Labs
t: @richard_d_jones, @CottageLabs
w: http://cottagelabs.com

I still don't know why the memory usage shot up to 800 MB or why the task got stuck. I had to terminate it from flower to avoid interference with rolling out the SpringerLink plugin.

It may be worth investigating that first. I'm not really sure how we could reliably limit the size of things we download though. The content-length of the response isn't always set (correctly) - but I guess it's better than nothing.

I would've set such a limit higher than 24 MB though, I didn't expect that to be a problem for Python. The actual comparison of license statements to content happens with a simple in. But there is a safeguard against bad unicode strings which took quite a bit of experimentation to do reliably. It does do a conversion to Python unicode type, potentially requiring more memory. Not a problem for an HTML page, but a PDF...

Refactoring simple_extract does not sound like the thing to do right now. So then, a 5 MB limit? Not sure what page is going to be larger than that.

Tsk, I'd rather have the system be able to process PDF-s too, there's no telling whether there'll be licenses in the string form of the PDF-s or not...

Perhaps we do nothing and look from time to time to terminate such hung tasks, and record them in this issue.

We should now be capturing the size of downloaded resources, so we can at
least look at this data and see what kind of objects sizes we are seeing
and whether there are any obvious outliers or repeat offenders.

In terms of download size limiting, we can inspect the content-length
header first, but we can also use requests in streaming mode and count the
bits as they come in and cut the stream off when too much has been
downloaded, so we do have some options.

On 12 April 2014 22:38, Emanuil Tolev notifications@github.com wrote:

I still don't know why the memory usage shot up to 800 MB or why the task
got stuck. I had to terminate it from flower to avoid interference with
rolling out the SpringerLink plugin.

It may be worth investigating that first. I'm not really sure how we could
reliably limit the size of things we download though. The content-length of
the response isn't always set (correctly) - but I guess it's better than
nothing.

I would've set such a limit higher than 24 MB though, I didn't expect that
to be a problem for Python. The actual comparison of license statements to
content happens with a simple in. But there is a safeguard against bad
unicode strings which took quite a bit of experimentation to do reliably.
It does do a conversion to Python unicode type, potentially requiring more
memory. Not a problem for an HTML page, but a PDF...

Refactoring simple_extract does not sound like the thing to do right now.
So then, a 5 MB limit? Not sure what page is going to be larger than that.

Tsk, I'd rather have the system be able to process PDF-s too, there's no
telling whether there'll be licenses in the string form of the PDF-s or
not...

Perhaps we do nothing and look from time to time to terminate such hung
tasks, and record them in this issue.


Reply to this email directly or view it on GitHubhttps://github.com//issues/71#issuecomment-40292473
.

Richard Jones,

Founder, Cottage Labs
t: @richard_d_jones, @CottageLabs
w: http://cottagelabs.com

Decided to start putting updates on these performance and resource usage issues here.

So far I had to scale down the num of celery workers a bit and scale down concurrency to 4 (from 8). Way too heavy resource usage otherwise under heavy load.

This resulted in results coming through slower, but it was actually not a very good configuration since most of the workers did nothing most of the time, and the really heavily loaded workers (to detect licenses) were only 2. Now I've changed that - the priority_* queues are being taken care of by a single worker (and the website works just fine and still only takes 1 attempt to get the result).

This has freed up 2 workers to work on bulk license detection.

Latest config can be glimpsed as usual here: http://flower.oag.cottagelabs.com/ (google single sign-on auth is on).

@richard-jones let me know if that makes results a bit faster - I am also close to finishing those other updates on not processing PDF-s and so forth. I was just on the machine and had to restart celery, so decided to take advantage to learn if I could use 1 worker for multiple queues :). (yes..)