janlelis/unicode-emoji

New architecture proposal to reduce memory usage

Closed this issue · 4 comments

Hello.

I noticed that unicode-emoji gem takes more memory than I expected from such a library. Just requiring the gem takes 7-8MB. I know that for today's standards it isn't a huge amount, but if many such gems were used then it could add up to unnecessary memory usage.

Here is a method I used to measure memory usage (I use get_process_mem gem):

require 'get_process_mem'

def mem(&block)
  raise ArgumentError, 'missing block' unless block

  mem = GetProcessMem.new
  before = mem.mb
  block.call
  after = mem.mb
  return after - before
end

puts mem { require "unicode/emoji" }

Running multiple times this script, gives me numbers between 7-9MB (most of the time something around 7.6MB).

I also used memory_profiler:

require 'memory_profiler'
report = MemoryProfiler.report do
  require 'unicode/emoji'
end

report.pretty_print

It gives the information where memory is allocated but also how much of it is retained (which in most cases means it will never be freed).

I'm not sure how exactly people use this gem but looking at the content I suspect that most probably they use one of provided regex constants. And this is the case in the application I work on. We literally use single regex from this library (Unicode::Emoji::REGEX).

What can be done to lower memory usage? Here is the idea:

  • instead of generating all the Regex constants, they could be generated offline and included directly in a file
  • every constant could go to different file and autoload :FOO, File.expand_path('emoji/foo', __dir__) could be used to lazy load it when it is used
  • INDEX could be lazy loaded too. If all regexes were generated offline then it would be only used by methods like properties. No method calls? No constant loaded.
  • some regexes are quite big (require "object_space"; ObjectSpace.memsize_of(...)), for example REGEX_VALID_INCLUDE_TEXT is almost 0.5MB. I didn't look closely but I think that some big unions like "||..." could be replaced by range "[char1-charN]" (if it is sequence of subsequent characters of course).

If done properly then for usage scenario like mine (single constant), memory usage would be reduce from 7-8MB to a size of that constant (in our case it is 120kB).

Do you think it is worth looking into it?

Hi @radarek,

thanks for brining this up and doing the researches.I think it would be great to optimize the index structure and memory behavior for typical use cases.

Some feedback to your thoughts:

instead of generating all the Regex constants, they could be generated offline and included directly in a file

Sounds good.

every constant could go to different file and autoload :FOO, File.expand_path('emoji/foo', dir) could be used to lazy load it when it is used

INDEX could be lazy loaded too. If all regexes were generated offline then it would be only used by methods like properties. No method calls? No constant loaded.

Is the autoload currently encouraged? I have always liked it, and if concurrency issues can be ruled out, I am all for it

some regexes are quite big (require "object_space"; ObjectSpace.memsize_of(...)), for example REGEX_VALID_INCLUDE_TEXT is almost 0.5MB. I didn't look closely but I think that some big unions like "||..." could be replaced by range "[char1-charN]" (if it is sequence of subsequent characters of course).

I haven't looked into optimizing the generated regexes, so this sounds exciting.

Hi @janlelis

thanks for brining this up and doing the researches.I think it would be great to optimize the index structure and memory behavior for typical use cases.

So foar I'm focused on lazy loaded constants and optimizing size of regexes. Index indeed could be optimized too.

Is the autoload currently encouraged? I have always liked it, and if concurrency issues can be ruled out, I am all for it

I thought that too but it looks that it is still a valid way to load ruby code:

Having all the above I think it is safe to use it.

Great, thank you for these links! Looking forward to get #9 merged.

Released with v3.0.0!