kiwix/kiwix-tools

kiwix-serve ZIM fd needs to be smarter

kelson42 opened this issue · 13 comments

Current, kiwix-serve opens all the ZIM files specified at start and keeps a fd open for each of them (without counting the potential fd for Xapian). By doing this, we can be sure that at any time we can directly read a content of any ZIM file quickly - without initialising the ZIM reader.

The problems are:

  • Per default a Linux system can usually only open 1024 file description per process
  • The memory consumption can be quite high even if nobody access anything.

That's why I propose to:

  • Limit the size of the pool of ZIM file descriptor (per default I propose 500)
  • Make this configurable in with a command line parmater
  • Implement the pool with LRU algorithm
  • Load dynamically ZIM files if needed

This ticket is linked to #143

@mgautierfr You have work on this - or somthing related - today. Any feedback about this ticket?

We are hitting the problem of too many fd open:

Internal Server Error

An internal server error occured. We are sorry about that :/

error 24 opening file "/var/www/library.kiwix.org/zim/wiktionary/wiktionary_ru_all_maxi_2020-07.zim

With #353 we are not opening the zim file at startup.

We only create the reader when we need it but :

  • multi-zim search will open all books.
  • we have no limits on the cache size.

@mgautierfr This is a good precision you made here indeed. We try as well to fix the problem at the system level, cf. kiwix/container-images#148. But this is only temporary and for us only.

@mgautierfr @veloman-yunkan I think it is time here to move on this. Beside what has already been reported as problems, the goal is to limit the overall/optimize the overall memory usage. Somehow the concept of pool and pool size might be needed. Regarding the multizim search, it is quite clear to me that, to keep it usable, we will have to limit this function to a subset of the available zim files. Maybe this maximal subset side should be the same as the pool size (or a derivative of it)?

Anyway, we need to clairfy the approach before implementing it.

@mgautierfr told me there is a possibility that kiwix/libkiwix#620 just solves the problem by introducing a pool of active readers.

@veloman-yunkan @maneeshpm @veloman-yunkan Would be great to confirm before starting anything else on this ticket.

@mgautierfr @veloman-yunkan I wonder if such an improvement would really bring much now that multizim search works fine. What do you think?

I believe the following discussion in the scope of kiwix/libkiwix#729 is relevant to this ticket:

kiwix/libkiwix#729 (review)
kiwix/libkiwix#729 (comment)
kiwix/libkiwix#729 (comment)

@veloman-yunkan I’m not sure what to think or what to propose. Do you think something should/could be done? If « yes », what and why?

@kelson42 I stick to my previous comment:

I think that the main requirement must be ensuring that the same ZIM file is never opened more than once at the same time, otherwise it will result in additional pressure on RAM by the cluster caches (since the searcher isn't limited to reading the xapian database via a direct reference but also accesses article content during snippet generation). To this end the archive cache system must keep track of all zim::Archives that are currently alive. Then we can think how to build the searcher cache on top of the archive cache.

After discussion:

  • Most of the time, there is only 2 or 3 fd open per zim files (one for the archive, one for the fulltext search database and one for the suggestion search database)
  • When doing multizim search, we add another opened fd per zim searched (fulltext database are not shared between searcher). This is not a easy problem as fulltext database are not threadsafe. So if we share the database, we have to lock other searchers sharing the same database.
  • As Searcher keep a reference to the archive, we may have situation when we drop the archive from the cache but keep the searcher in it. When we want to open the archive again, we don't find it in the cache and so we open it again (and so we have two opened archive for the same zim file)
  • By default the cache is calculated to a 1/10 of the number of books. So if you have a small number of books, we only have a cache of one book.

So we have to do:

  • Avoid to open several times a Archive (fix point 3)
  • Set a minimum cache size to something more that 1 (10 is probably a good number)

The point 2 seems a bit tricky about the threadsafe lock. Not sure we want to do that now. (Unless you have already a idea of how to fix that)

@mgautierfr Thank you for the summary and the action plan. This is ready to implement.