uber/neuropod

Parent process to update of NEUROPOD_BASE_DIR doesn't affect "register backend" IPE and OPE mode.

vkuzmin-uber opened this issue · 1 comments

Bug

New env. var NEUROPOD_BASE_DIR allows to specify location where register backend searches for backends.
I tried to update NEUROPOD_BASE_DIR from parent process at the very beginning (used setenv and putenv from stdlib) but it didn't have effect on Load Model in OPE mode. My guess this is because for OPE mode before spawn of "worker" it reads env vars from:
extern char **environ;

that doesn't change with setenv/putenv calls. It seems that currently env var should be set before process start.

In our case we consider system with dynamic configuration where "service" doesn't need restart to add/remove backends.
It would be good to have some "program" way how parent process can control "base path" of "register_backend".

To Reproduce

I can write a test for setnv+NEUROPOD_BASE_DIR case but the right solution can be not related to NEUROPOD_BASE_DIR but some new Control API that allows to change "neuropod base" for backends in runtime.

extern char **environ;

that doesn't change with setenv/putenv calls. It seems that currently env var should be set before process start.

That is incorrect. setenv will modify environ. See this simple test:

#include <iostream>
#include <string>
#include <unordered_map>

#include <stdlib.h>

extern char **environ;

namespace
{

// (This is identical to the code used to launch the worker process)
// See https://github.com/uber/neuropod/blob/c15727e1524f6fc22f87f0ac5bee14751f779087/source/neuropod/multiprocess/multiprocess.cc#L46-L67
// A utility to get the environment as a map
std::unordered_map<std::string, std::string> get_env_map()
{
    std::unordered_map<std::string, std::string> env;
    for (char **current = environ; *current; current++)
    {
        std::string item = *current;
        const auto  pos  = item.find('=');
        if (pos == std::string::npos)
        {
            // No `=` found
            continue;
        }

        const auto key = item.substr(0, pos);  // Not including the `=`
        const auto val = item.substr(pos + 1); // Not including the `=`

        env[key] = val;
    }

    return env;
}

} // namespace

int main()
{
    setenv("NEUROPOD_BASE_DIR", "asdf", 1);
  
    const auto env = get_env_map();
    for (const auto &item : env)
    {
        std::cout << item.first << ": " << item.second << std::endl;
    }
}

The output will contain NEUROPOD_BASE_DIR: asdf.

Are you sure you're updating the environment before loading a model?

Also, just want to note that setenv is not safe to use in multithreaded applications. See https://rachelbythebay.com/w/2017/01/30/env/, http://www.club.cc.cmu.edu/~cmccabe/blog_the_setenv_fiasco.html, and https://man7.org/linux/man-pages/man3/setenv.3.html

(I'm aware that we use setenv in the pythonbridge backend and sometimes there isn't a way around it, but just wanted to point it out in case you started to see issues)

Feature request

In our case we consider system with dynamic configuration where "service" doesn't need restart to add/remove backends.
It would be good to have some "program" way how parent process can control "base path" of "register_backend".

It sounds like the goal is "atomic" additions/removals of backends when using OPE (without requiring a service restart).

Your service would create a new base_dir and use it for new model instances while "draining" the old base dir and removing it once there are no remaining references to it. There are probably optimizations with symlinks here too.

Is that what you mean?

If so, I think adding some configuration to RuntimeOptions is probably the cleanest way for us to make this possible.

Feel free to elaborate if I misunderstood.

Misc

For anyone reading, if you just want a different set of backends to be visible to different models, the recommended way is for each model to specify the version(s) of a framework it's compatible with at packaging time (https://neuropod.ai/docs/master/packagers/torchscript/#platform_version_semver).

That way, one model could say it's compatible with Torch 1.2.0 - 1.4.0 while another requires 1.7.0. None of the above "dynamic backend replacement" stuff is necessary for that usecase.