nltk/nltk

Downloader race condition with multiple processes

naktinis opened this issue · 5 comments

When calling nltk.download followed by model usage from multiple processes at once, at least four types of errors can happen:

  • Error with downloaded zip file
  • [Errno 2] No such file or directory
  • zipfile.BadZipFile: Bad CRC-32 for file
  • LookupError

I can reproduce this every time using the following script (but you may need to tweak the sleep time or number of processes to match your hardware and internet connection):

import os
import nltk
import time
import random
import shutil
import multiprocessing

def download_and_run():
    print("Downloading...")
    time.sleep(random.random() * 2)
    nltk.download('punkt', force=True, download_dir='models')
    nltk.data.path.insert(0, 'models')
    print(nltk.sent_tokenize("Some text to split. Another sentence."))

if __name__ == "__main__":
    shutil.rmtree('models', ignore_errors=True)

    processes = []
    for i in range(10):
        processes.append(multiprocessing.Process(target=download_and_run))

    for p in processes:
        p.start()

    for p in processes:
        p.join()

The issue can be solved by writing model files atomically, as proposed in this pull request: #3247.

ekaf commented

@naktinis, in which realistic use case would you want to download the same package multiple times at once to the same file location?

When calling nltk.download followed by model usage from multiple processes at once

@naktinis, in which realistic use case would you want to download the same package multiple times at once to the same file location?

@ekaf this is actually happening when running multiple concurrent task workers that use nltk. Basically, there is a pool of independent worker processes that listen to a queue of incoming tasks. Each worker is responsible to make sure the environment is set up (which includes downloading any models). Each worker is encapsulated and starts independently (e.g. what you would have with celery tasks), meaning there's no concept of centralized setup.

To make matters worse, when the service restarts, it starts in a clean environment (set up as a Docker container on a node in "the cloud"), which means on service restart we normally don't have the models on disk, so every restart means model download will be necessary.

ekaf commented

@naktinis, it sounds like your problem is not specific to nltk: any software that your workers install would meet race conditions.
Solving the problem in nltk would not solve it for other data packages.

Solving the problem in nltk would not solve it for other data packages.

True, but in this case it was happening with nltk and other packages have their ways to deal with similar issues. For example, Hugging Face model download logic also uses the "download to a temporary file and move to destination" strategy: https://github.com/huggingface/huggingface_hub/blob/5ff2d150d121d04799b78bc08f2343c21b8f07a9/src/huggingface_hub/file_download.py#L1833C5-L1833C30

They also seem to be using additional locking during downloads here: https://github.com/huggingface/huggingface_hub/blob/5ff2d150d121d04799b78bc08f2343c21b8f07a9/src/huggingface_hub/file_download.py#L1366 (but that's just one way of dealing with this and maybe they also had other reasons).

ekaf commented

Hugging Face model download logic also uses the "download to a temporary file and move to destination" strategy

Ok, it seems prudent to also use this approach in nltk.

But all these useless downloads should not happen in the first place. If all your worker processes are sharing the same filespace, any package only needs to be downloaded once. All the subsequent downloads are useless, and may add up to a huge waste of energy. If such useless downloads are common in ML, it congests the networks, and wastes a lot of electricity.