EleutherAI/gpt-neox

Tests fail when run with pytest --forked

segyges opened this issue · 1 comments

Describe the bug
When tests are run with pytest --forked per the instructions in /test/README.md, a large number of tests fail with the error:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

This appears to be a problem with the way tests are run in subprocesses. It makes testing, and therefore development on the library, rather difficult.

To Reproduce
Steps to reproduce the behavior:

  1. Install neox in your environment however you normally do
  2. Probably initialize a training run to make sure your environment is clean
  3. Exit out of that run
  4. cd /tests
  5. pytest --forked

Expected behavior
Tests pass, or fail for reasons to do with the code in the tests themselves.

Proposed solution
I have no idea.

Environment (please complete the following information):

  • GPUs: 2x 3090s
  • Configs: N/A

Additional context
forked-report.zip
Attached html report of the failures on the tests

Currently sidestepping this with #1149 until we have time to more properly resolve the issue with launching CUDA in forked processes.

Some tests are back, all are cleaned a bit, and model training tests are skipped for now.