ndif-team/nnsight

Memory leak after trace?

tvhong opened this issue · 3 comments

Description

There is a potential memory leak issue when the user saves an output tensor across multiple traces.

I've tried this on mps backend, but it might exist for the cuda backend as well (obvslib/obvs#10).

Please let me know if I'm missing something obvious.

Reproduction

  1. Create an empty venv and install the requirements from requirements.txt (or just pip install torch nnsight).
  2. Create a memtest.py file:
# memtest.py

import torch
from nnsight import LanguageModel


PROMPT = ''.join(['This will be a long prompt']*160)


def test_reuse_model():
    print("Testing reuse model")

    model = LanguageModel("gpt2", device_map="mps")
    x = None

    for i in range(10):
        with model.trace(PROMPT):
            x = model.transformer.h[1].output.save()

        print(f"After {i + 1} run")
        print(f"Memory usage: {torch.mps.current_allocated_memory() / 1024 ** 2 : .2f}mb")
        torch.mps.empty_cache()


def test_reinit_model():
    print("Testing reinit model")

    x = None

    for i in range(10):
        model = LanguageModel("gpt2", device_map="mps")
        with model.trace(PROMPT):
            x = model.transformer.h[1].output.save()

        print(f"After {i + 1} run")
        print(f"Memory usage: {torch.mps.current_allocated_memory() / 1024 ** 2 : .2f}mb")
        torch.mps.empty_cache()


if __name__ == "__main__":
    test_reuse_model()
    test_reinit_model()
  1. Run the test and observe output:
Testing reuse model
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 1 run
Memory usage:  806.33mb
After 2 run
Memory usage:  1051.93mb
After 3 run
Memory usage:  1298.41mb
After 4 run
Memory usage:  1298.41mb
After 5 run
Memory usage:  1544.88mb
After 6 run
Memory usage:  1791.36mb
After 7 run
Memory usage:  1791.36mb
After 8 run
Memory usage:  2037.83mb
After 9 run
Memory usage:  2037.83mb
After 10 run
Memory usage:  2284.31mb
Testing reinit model
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 1 run
Memory usage:  827.47mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 2 run
Memory usage:  731.45mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 3 run
Memory usage:  827.47mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 4 run
Memory usage:  731.45mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 5 run
Memory usage:  827.47mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 6 run
Memory usage:  731.45mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 7 run
Memory usage:  827.47mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 8 run
Memory usage:  731.45mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 9 run
Memory usage:  827.47mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 10 run
Memory usage:  731.45mb

Expectation

The memory usage is the same between test_reuse_model and test_reinit_model.

@tvhong This should be fixed from this commit bdbf682 on dev. Could you run using the dev branch and give it a try?

@JadenFiotto-Kaufman confirm that the issue is fixed on dev branch. Thank you for the quick turn around!

Testing reuse model
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 1 run
Memory usage:  806.33mb
After 2 run
Memory usage:  731.45mb
After 3 run
Memory usage:  731.45mb
After 4 run
Memory usage:  731.45mb
After 5 run
Memory usage:  731.45mb
After 6 run
Memory usage:  731.45mb
After 7 run
Memory usage:  731.45mb
After 8 run
Memory usage:  731.45mb
After 9 run
Memory usage:  731.45mb
After 10 run
Memory usage:  731.45mb
Testing reinit model
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 1 run
Memory usage:  828.36mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 2 run
Memory usage:  733.19mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 3 run
Memory usage:  825.73mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 4 run
Memory usage:  733.19mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 5 run
Memory usage:  825.73mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 6 run
Memory usage:  733.19mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 7 run
Memory usage:  825.73mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 8 run
Memory usage:  733.19mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 9 run
Memory usage:  825.73mb
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
After 10 run
Memory usage:  733.19mb

Confirm that the issue is fixed in version 0.2.11.