taku910/mecab

Memoly leak when use python-wrapper and input string is too long

ankokumoyashi opened this issue · 0 comments

memoly leak When the following conditions are fullfilled

  • use python wrapper("-C (allocate sentence)" option is ON)
  • use same lattice instance within each loop
  • input bytes over 5534

How to reproduce

  • versions

    • Python 3.5.1
    • mecab of 0.996
  • code

    import MeCab
    import os
    import psutil
    import sys
    pid = os.getpid()
    py = psutil.Process(pid)
    
    
    class CheckMemoryLeak():
        def __init__(self):
            self.lattice = MeCab.Lattice()
    
        def mecab_set_sentence(self, text):
            self.lattice.set_sentence(text)
    
    
    if __name__ == '__main__':
        Mecab = CheckMemoryLeak()
        sentence = 'あ' * 2730
        print('input bytes:', sys.getsizeof(sentence))
        while True:
            Mecab.mecab_set_sentence(sentence)
            memoryUse = py.memory_info()[0]
            print('memory use:', memoryUse)
  • result

    input bytes: 5534
    memory use: 13950976
    ・・・(about 10 times mecab_set_sentence)
    memory use: 14221312
    ・・・(about 10 times mecab_set_sentence)
    memory use: 14491648
    ・・・(after 30 seconds)
    memory use: 2043158528
    

However, in the case of the following code

sentence = 'あ' * 2729
  • result
    input bytes: 5532
    memory use: 13950976
    ・・・(about 10 times mecab_set_sentence)
    memory use: 14155776
    ・・・(after 30 seconds)
    memory use: 14155776
    ・・・(after 10 minutes)
    memory use: 14155776
    

Probable Cause

  • It is not checked that the number of bytes of input_str is less than or equal to BUF_SIZE.
  • It is considered that a memory leak has occurred when allocating a character string of a size exceeding BUF_SIZE after allocating an area for BUF_SIZE.
  • BUF_SIZE, MIN_INPUT_BUFFER_SIZE, MAX_INPUT_BUFFER_SIZE can not be set with setting file, options, etc. only input-buffer-size

char *alloc(size_t size) {
if (!char_freelist_.get()) {
char_freelist_.reset(new ChunkFreeList<char>(BUF_SIZE));
}
return char_freelist_->alloc(size + 1);
}
char *strdup(const char *str, size_t size) {
char *n = alloc(size + 1);
std::strncpy(n, str, size + 1);
return n;
}

Temporary solution

  1. Edit BUF_SIZE

mecab/mecab/src/common.h

Lines 72 to 74 in 3a07c4e

#define MIN_INPUT_BUFFER_SIZE 8192
#define MAX_INPUT_BUFFER_SIZE (8192*640)
#define BUF_SIZE 8192

  • before

    #define MIN_INPUT_BUFFER_SIZE 8192
    #define MAX_INPUT_BUFFER_SIZE (8192*640)
    #define BUF_SIZE 8192
  • after

    #define MIN_INPUT_BUFFER_SIZE 16384
    #define MAX_INPUT_BUFFER_SIZE (16384*640)
    #define BUF_SIZE 16384
  1. rebuild&reinstall
make
sudo make install

Proposed solution

The problem is that execution will not stop even if a memory leak occurs

  • Warn if input string exceeds BUF_SIZE also python-wrapper