Parsl/parsl

TaskVine vs threadpoolctl import error reporting

Opened this issue · 5 comments

Describe the bug

TaskVine installed from source starts needing threadpoolctl at commit 0d3c115b082f3f6e365385d7602c923b676a9a3f

pytest of taskvine with this config:

from parsl.config import Config
from parsl.data_provider.file_noop import NoOpFileStaging
from parsl.data_provider.ftp import FTPInTaskStaging
from parsl.data_provider.http import HTTPInTaskStaging
from parsl.executors.taskvine import TaskVineExecutor, TaskVineManagerConfig


def fresh_config():
    return Config(executors=[TaskVineExecutor(manager_config=TaskVineManagerConfig(port=9000, shared_fs=True),
                                              worker_launch_method='factory', function_exec_mode='serverless')])

stops working at that commit and instead hangs.

It doesn't log anything that I could find about needing threadpoolctl.

This shouldn't be a problem when installing from proper packaging - but more generally I think it is a problem that some internal piece is breaking and hanging, and that same situation may arise for other more relevant errors.

To Reproduce
Use TaskVine at that commit above with the supplied config file, and make sure threadpoolctl is not installed with pip

Expected behavior
proper shutdown with error message leading the user towards the problem

Environment
my laptop

Some observations on my end:

  • I tried running a test with CCTools 1 commit before threadpoolctl: 4a9b3084f170284d4851404d122d1a6f13675b73 and my test still hangs. I wonder if threadpoolctl has anything to do with the hanging.
  • Here's my test for reference. It creates 3 types of executors (threadpool, htex, taskvine) and runs a simple function on all.
  • Putting the dfk in a with context resolves the hanging. Without this context, taskvine hangs. Htex doesn't hang and instead exits but the log doesn't show the dfk calls its cleanup method (to test this you must make htex the last executor before Python exits).

A question to follow up:

  • Should using with context be the recommended way for parsl to clean up? If so I think it warrants changing the test code to reflect this. Wdyt @benclifford ?
#!/usr/bin/env python3

import parsl
from parsl import python_app
from parsl import bash_app
from parsl.config import Config
from parsl.executors.taskvine import TaskVineExecutor
from parsl.executors.taskvine import TaskVineManagerConfig
from parsl.executors.taskvine import TaskVineFactoryConfig
from parsl.providers import LocalProvider
from parsl.executors.threads import ThreadPoolExecutor
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor

local_thread_config = Config(
        executors=[ThreadPoolExecutor(
            max_threads=4,
            label='local-threads')],
        strategy=None,)

htex_config = Config(
        executors=[HighThroughputExecutor(
            label='local-htex',
            cores_per_worker=4,
            provider=LocalProvider(channel=LocalChannel(),
            init_blocks=1,
            max_blocks=1,))],
        strategy=None,)

vine_config = Config(
        executors=[TaskVineExecutor(
            label='local-vine',
            worker_launch_method='factory',
            function_exec_mode='regular',
            manager_config=TaskVineManagerConfig(
                project_name='parsl-vine'),
            factory_config=TaskVineFactoryConfig(
                min_workers=1,
                max_workers=1,
                workers_per_cycle=1,
                cores=4,
                batch_type='local'))],
            strategy=None,)

all_configs = [local_thread_config, htex_config, vine_config]

def f(x):
    return x + 1
f_app = python_app(f)

def main():
    active_configs = [local_thread_config, htex_config, vine_config]
    #active_configs = [local_thread_config, htex_config]
    for config in active_configs:
        #with parsl.load(config):
        parsl.load(config)
        num_tasks = 1
        arg = 5
        results = []
        for _ in range(num_tasks):
            results.append(f_app(arg))

        total_sum = 0
        for i in range(num_tasks):
            total_sum += results[i].result()

        expected = f(arg) * num_tasks
        assert total_sum == expected
        
        parsl.clear()
        print(f'OK: {config.executors[0].label}')

if __name__ == '__main__':
    main()

Ah I see what you mean. Under the hood TaskVine sends a library and then sends a function to run inside the library. The library now requires threadpoolctl so it will fail if it can't import threadpoolctl. This triggers infinite (I think) retries from the manager making the code looks like it hanged. A look into taskvine logs should give some evidence for this (or against this).

Deeper I think the question would be how hard do we try before giving up and telling the user so...

Most of the rest of Parsl has a "abandon as soon as there is an error, and let Parsl-level retry policy deal with it". Which usually people have as none, or a small number of tries. That leads, ideally, to whatever the error was being propagated to the user's workflow code as an exception in the relevant task/app Future object. So...

Deeper I think the question would be how hard do we try before giving up and telling the user so...

... it would align with that philosophy to not try very hard at all and let someone else (the user, and Parsl retries) deal with it.

That makes sense to me. I added this to a list of improvements I'm planning for TaskVineExecutor. Thanks for the clarification!