Consistency issues in multi-world applications due to global static variables, additional issues when multi-threaded

Question

Consistency issues in multi-world applications due to global static variables, additional issues when multi-threaded

Naios opened this issue a year ago · 16 comments

@SanderMertens thank you for building the amazing flecs library. Your library is really impressive and a huge technical accomplishment. I have found the following issue:

Describe the bug
Currently, ids for components are cached inside a per-type local variable in cpp_type_impl<T>::s_id.

This can lead to issues if the same application has multiple flecs::world instances, since one cannot guarantee that each world gets initialized in the same order, and always has the same set of entity ids available.

The current way of dealing with this, is that it triggers an assertion although this case is not really a programming error and can happen in a valid program:

// component.hpp

        // If an identifier was already set, check for consistency
        if (s_id) {
            ecs_assert(s_id == entity, ECS_INCONSISTENT_COMPONENT_ID, 
                type_name<T>());
            ecs_assert(allow_tag == s_allow_tag, ECS_INVALID_PARAMETER, NULL);

            // Component was already registered and data is consistent with new
            // identifier, so nothing else to be done.
            return;
        }

Additionally, using the static global variables in cpp_type_impl<T>::s_id has additional issues in a multi-threaded context, for instance that you cannot guarantee the consistent state of the following variables without a lock:

        s_id = 0;
        s_size = 0;
        s_alignment = 0;
        s_allow_tag = true;

This can lead to inconsistencies between worlds and severe run-time issues. It would be great, if flecs could support running multiple worlds in the same process independently through not using such static variables.

I'm also not sure whether s_size and s_alignment are actually needed here on a global cache level since you can generate a temporary object containing this information quickly anyway with sizeof(T) and alignof(T).

Possible solution/Expected behaviour
This issue could be solved by removing all static variables and retrieve the component id for a specific component based on a hashmap where the key is a compile-time hashed name of of the component (like entt is doing it).

In general it would be great, if the library could not use such static variables overall since it limits the application of the library to just one instance per process.

Additional context
Revision: bf647d7

Answer 1 · 2023-08-27T15:13:57.000Z

A hashmap based approach adds runtime overhead, even if the hash itself is compile time generated. Flecs is also C++11 compatible, which limits the number of compile time features the library can use.

it would be great, if the library could not use such static variables overall since it limits the application of the library to just one instance per process

You can definitely have multiple Flecs worlds per application- there are many projects that do this. There are a few things you can do to prevent conflicts:

If worlds share the exact same set of components, make sure they're registered in the same order (for example by having a single function that creates the worlds & imports modules/registers components)
If worlds share components but they partially overlap, what you can do is create a temporary dummy world first in which you register all components. This initializes the static variables and ensures that when components are registered across worlds they all use the same id.

These approaches require minimal code to setup, and don't add runtime overhead to every ECS operation.

You can preregister components like this:

world.component<Position>();

Closing this issue as this is expected behavior.

Answer 2 · 2023-08-27T18:03:56.000Z

@SanderMertens thank you for your extensive answer.

I think this design decision limits the use case of flecs to only support multiple worlds with exactly the same layout and component initialization and could lead to unexpected issues when using multiple worlds. In release builds no one would even notice that the component id had a conflict.

To prevent this, would it be possible to make multi-threaded compatible component id assignments available through a feature flag/compile definition instead?

A hashmap based approach adds runtime overhead, even if the hash itself is compile time generated. Flecs is also C++11 compatible, which limits the number of compile time features the library can use.

Well, you are right that it is difficult to hash strings in C++11 for this reason, but there is a second O(1) approach to this, which could be a very suitable solution:

enum class cpp_type_instance : std::uint8_t;

template <typename T>
struct cpp_type_impl {
  static cpp_type_instance const instance{}; // Unique per compilation unit and type instantiation

  static flecs::entity id(flecs::world* world) {
    // O(1)
    if (auto const itr = world->component_id_assignments.find(&instance); itr != world->component_id_assignments) {
      return itr->second;
    }

    // Search for the same component by its type-name - O(n) due to string hashing
    if (auto const itr = world->component_id_type_names.find(component_name /*string hashing & comparison*/); itr != world->component_id_type_names.end()) { // Find by name
      component_id_assignments[&instance] = itr->second;
      return itr->second:
    }

    // If not found register the component
  }
};

struct world {
  std::unordered_map<cpp_type_instance const* /*instance*/, flecs::entity /*component_id*/> component_id_assignments;
  std::unordered_map<std::string /*type_names*/, flecs::entity /*component_id*/> component_id_tyoe_names;
};

In this approach we do not use string hashing, but instead use an address that is unique per compilation unit and type instantiation, which also allows O(1) hashmap access.

I think this could be a very useful addition to your library.

Answer 3 · 2023-08-27T18:21:53.000Z

limits the use case of flecs to only support multiple worlds with exactly the same layout and component initialization

This is not the case- see my second approach from the previous message:

If worlds share components but they partially overlap, what you can do is create a temporary dummy world first in which you register all components. This initializes the static variables and ensures that when components are registered across worlds they all use the same id.

would it be possible to make multi-threaded compatible component id assignments

A hashmap based approach is also not thread safe. Even in EnTT I believe you need to preregister components in a multithreaded application to prevent concurrent writes to the hashmap.

In this approach we do not use string hashing, but instead use an address that is unique per compilation unit and type instantiation, which also allows O(1) hashmap access.

Still, a hashmap lookup per ECS operation would add runtime overhead. The current approach has minimal runtime overhead, and also supports multi world, multi threaded, multi binary applications, provided it's setup correctly.

Note that the C++ API is built on top of the C API. You could always fork Flecs and replace the existing component id mechanism with you own without modifying the core library code.

Answer 4 · 2023-08-27T19:31:06.000Z

I agree that registering components concurrently in the same world is clearly out-of-scope and did not mean it this way.
But not allowing component registration in a dynamic order across multiple worlds without explicit pre-registration is a large drawback due to following reasons:

By implying that the user must register all components in advance, it is impossible to register a different set of components through plugins or addons, where each world potentially has a different set of components that are not known ahead of time.
Further, because the registration happens implicitly by cpp_type_impl::id() the order can easily mismatch based on which flecs functions are used conditionally. Also this means that one has to remind initializing every component and every tag inside a initialization function which increases (unwanted) code coupling and decreases code locality. Further, until C++17 function parameter evaluation is unspecified and might lead to inconsistent ordering of cpp_type_impl::id().
Additionally, by supporting order-independent registration it would automatically support concurrent registration over multiple worlds, where each world is only accessed from one thread.

I thought about a third solution that would only add very little performance overhead and does not use a hashmap, and will be very safe to use:

You could index each instance statically with an incrementing counter, and keep the id assignments inside a linear or chunked array local to each world. Something like:

// .h
extern std::atomic<std::size_t> global_counter;

// .cpp
extern std::atomic<std::size_t> global_counter = 0;

// .h
template <typename T>
struct cpp_type_impl {
  // Indexes into the component_cache
  static std::size_t const instance = global_counter.fetch_add(1U) - 1U;
  
  static flecs::entity id(flecs::world* world) {
    // O(1)
    if (instance < world->component_cache.size()) {
      return world->component_cache[instance];
    }

    // If not found register the component
  }
};

struct world {
  std::vector<flecs::entity> component_cache; // Fast linear access by index
};

This would solve all the mentioned issues at a minimal cost of just one indexed array element lookup.
My personal opinion is that I do not see how the small performance gain of a static global variable in comparison to world-local array justifies the possible runtime issues when you use multiple worlds or the care to handle component initialization always in the right order.
Also the assertion that protects against inconsistent component id registration is not active in release builds and would leave the library silently in an inconsistent dangerous state.

I respect your opinion categorizing this as wontfix, and might just patch a local copy with the suggested changes.
Thank you for your feedback.

Answer 5 · 2023-08-27T21:06:20.000Z

By implying that the user must register all components in advance, it is impossible to register a different set of components through plugins or addons, where each world potentially has a different set of components that are not known ahead of time.

With the aforementioned dummy world approach worlds can have different sets of components. What the dummy world approach achieves is that all static variables are initialized with non-conflicting ids. Every subsequently generated world will register the components with non-conflicting ids. For example:

{
  flecs::world dummy_world;
  dummy_world.component<A>(); // id 1
  dummy_world.component<B>(); // id 2
}

flecs::world world_1;
world_1.componen<A>(); // id 1

flecs::world world_2;
world_2.component<B>(); // id 2

by supporting order-independent registration it would automatically support concurrent registration over multiple worlds, where each world is only accessed from one thread.

That's true, fair point.

I do not see how the small performance gain of a static global variable in comparison to world-local array justifies the possible runtime issues when you use multiple worlds

The tradeoff is whether you want to penalize the common case (a single world) to address a problem that is far less common (multiple worlds) and which has known solutions that don't have runtime overhead.

The current approach has additional usability benefits. Say I wanted to write a generic routine that copies components for an entity from world A to world B (not uncommon in multi-world applications). If component ids are not consistent across worlds, you'd need to maintain a mapping. Even if that by itself is justifiable, it would be a breaking change.

The advantage of the current approach is that while it does require you to write explicit registration code, that code is localized to a single function that sets up the component ids in a dummy world. The alternative requires every part of the application that works with multiple worlds to be aware that the component ids can be different.

I thought about a third solution that would only add very little performance overhead and does not use a hashmap, and will be very safe to use:

Having said that I do think this is an interesting approach that's worth looking into. Flecs doesn't depend on STL (besides type_traits) so I'd have to write my own version of atomic, but that'd be doable.

One challenge is that you can't access the flecs::world::component_cache vector in scenarios where the world object is constructed from the underlying C object ad-hoc, like when doing flecs::iter::world(). The vector would have to be stored on the C object.

Answer 6 · 2023-08-27T21:19:43.000Z

Hm, actually I don't think this would work across binaries :/ different binaries could end up with different indices for the component.

Answer 7 · 2023-08-27T22:21:57.000Z

The tradeoff is whether you want to penalize the common case (a single world) to address a problem that is far less common (multiple worlds) and which has known solutions that don't have runtime overhead.

It might be a good solution to specialize for the common case and let users opt-in for the edge-case via a preprocessor definition.

One challenge is that you can't access the flecs::world::component_cache vector in scenarios where the world object is constructed from the underlying C object ad-hoc, like when doing flecs::iter::world(). The vector would have to be stored on the C object.

Probably the vector and the counter need to be moved into the c-library, since as far as I understand, the C++ wrapper is currently only a header-only library.

To make this work correctly, the counter needs to be defined inside a compilation unit and accessible through a function that can also be exported when flecs is compiled as shared library.

The current approach has additional usability benefits. Say I wanted to write a generic routine that copies components for an entity from world A to world B (not uncommon in multi-world applications). If component ids are not consistent across worlds, you'd need to maintain a mapping. Even if that by itself is justifiable, it would be a breaking change.

Initializing the counters in an arbitrary or potential random order does not influence the actual assignments of component ids. The assignment of component ids still depends on the initialization order alone (the call to cpp_type_impl::id()), and therefore my approach is equal to the current one used in the library in regard to the component id order.

If the library assumes component id consistency across multiple worlds, then this is another issue which would be needed to taken into account. If you want to copy a world or create a sub-world, you would just have to propagate the component-id-cache vector to ensure that both worlds have the same fixed set of component ids registered already.

Answer 8 · 2023-08-29T18:40:06.000Z

Closing this as the issue with multiple binaries prevents this from working correctly.

Answer 9 · 2023-08-30T20:28:02.000Z

I currently do not see any particular issue with this solution. Could you describe the issues you are seeing?

Additionally, since this overall issue or feature request describes a particular issue with a broad set of possible solutions, it would also be possible to fix this issue with a different solution?

Answer 10 · 2023-08-30T20:57:29.000Z

different binaries could end up with different indices for the component.

You can't guarantee that this:

struct cpp_type_impl {
  // Indexes into the component_cache
  static std::size_t const instance = global_counter.fetch_add(1U) - 1U;

won't get invoked multiple times for the same component id across different binaries, which would result in the same component getting multiple indices.

it would also be possible to fix this issue with a different solution?

True, if a solution is found that addresses this problem and doesn't measurably increase runtime overhead I'm happy to reopen the issue/create a new one.

therefore my approach is equal to the current one used in the library in regard to the component id order

Right. My point was that today component ids are kept the same across worlds which is something that applications could depend upon. Allowing component ids (not indices) to be different across worlds can break existing code.

Answer 11 · 2023-08-30T21:37:49.000Z

I took into account that the same component can get multiple indices, but this should not be an issue since all the indices will be resolved and cached to the same component id through the name lookup. I still think that this solution is ideal. If it actually affects performance in a measurable way needs to be tested - I would offer to open a PR for this.

My proposal is equal to how component ids are currently assigned and would not change the assignment order nor the resulting ids.

On the other hand, requiring that all components have the same id is a fundamental design decision which makes solving this issue probably impossible since creating multiple worlds with different component initialization will lead to different component ids.

Answer 12 · 2023-08-31T17:16:55.000Z

since all the indices will be resolved and cached to the same component id

Ahh gotcha. That's true I hadn't thought about that. Ok that sounds like a reasonable thing to try out :)

Answer 13 · 2023-09-03T15:31:55.000Z

I have looked into a possible implementation of this. And I was questioning, whether you want to keep the current static storages of the component class at all?

It seems there is currently a workaround for this issue in form of

inline void reset() {
    ecs_cpp_reset_count_inc();
}

already.

Additionally, the current implementation has the additional issue, that setting the members

    static entity_t s_id;
    static size_t s_size;
    static size_t s_alignment;
    static bool s_allow_tag;
    static int32_t s_reset_count;

is not threadsafe and can cause issues how it is currently implemented.
Further, the members in this class seem not to be initialized correctly through any constructor.

Therefore, I would prefer changing the current implementation to a world-local component id storage as discussed above, without offering any compile definition feature flag. This would then resolve all mentioned issues at once. Further, we could then remove the reset counter workaround since world-local component ids would be supported.

Additionally, I have recognized that a flecs world in C can also have mismatching component ids depending on in which order ECS_COMPONENT(ecs, Velocity); is called.
Therefore, would it be possible to support a mapping to translate the id of the same component across worlds, or when would this be needed?

Answer 14 · 2023-09-03T16:44:11.000Z

whether you want to keep the current static storages of the component class at all?

Probably not, everything could be stored on the world save for the generated index.

Further, we could then remove the reset counter workaround since world-local component ids would be supported.

Yup, this is a nice side benefit

an also have mismatching component ids depending on in which order ECS_COMPONENT(ecs, Velocity); is called.

Correct

would it be possible to support a mapping to translate the id of the same component across worlds, or when would this be needed?

You would need it when copying entities/components between worlds.

Answer 15 · 2023-10-17T10:59:25.000Z

Hi there,

I just wanted to chip in since I've been 2 days trying to debug my tests, wondering what was going on. As a Flecs newbie this has become extremely frustrating. The problem was my tests were crashing in the strangest ways, and was about to submit a bug when I saw this thread.

Eventually, I narrowed it down to this minimal example:

struct ComponentA {
  char x = 0;
};

struct ComponentB {
  int x = 0;
  int y = 0;
};

void print_components(flecs::entity e) {
  int i = 0;
  e.each([&](flecs::id id) {
    std::cout << i++ << ": " << id.str() << " (" << id << ")"
              << "\n";
  });
  std::cout << "\n";
}

void func1() {
  std::cout << "func1" << std::endl;
  flecs::world ecs;
  ecs.component<ComponentB>();

  auto a = ecs.entity().add<ComponentB>();

  print_components(a);
}

void func2() {
  std::cout << "func2" << std::endl;
  flecs::world ecs;
  ecs.component<ComponentA>();
  ecs.component<ComponentB>();  // crash

  auto a = ecs.entity().add<ComponentB>();

  print_components(a);
}

int main() {
  func1();
  func2();
  return 0;
}

... where func1() and func2() are a simplification of what were actually different unit tests that each test a single system, thus it makes sense to create a dedicated throwaway world just to test that system in particular.

This example crashes when registering ComponentB in func2() with fatal: entity.c: 2108: ComponentA (INVALID_COMPONENT_SIZE).

It is clear now that it is due to some id reused that makes it think ComponentB is ComponentA. It was not so clear in my bigger tests, where everything was mixed, so I would get crashes in weird places.

I was totally unaware of this pitfall. Thanks to @Naios comment I found about flecs::reset(), but that is not referenced in the guides.

Although keeping a world-local component registry may be more cumbersome, it is certainly more intuitive and will help adoption.

In any case, thank you @SanderMertens and everybody for this great work hopefully I'll have this in our game soon!

Answer 16 · 2024-02-12T18:43:49.000Z

@Naios @SanderMertens I wrote the following to try to address the issue, because its already hit me twice, first as a newbie and now the last few days. Ours is multi-world application and trying to guarantee same loading order would be complicated and require maintenance.

I wrote a constexpr compile-time, C++11 compatible function that generates per-type unique ids, but I don't know where to introduce it exactly for a potential PR. The idea would be to have it be opt-in, something like #define FLECS_DETERMINISTIC_COMPONENT_IDS. When enabled, component ids are handed out via the below hash function. When turned off, it works as before.

Here is the code:

#include <array>
#include <cstddef>
#include <cstdint>
#include <iostream>

#if __cplusplus >= 202002L
#include <source_location>
#endif

// FNV-1a hash function
constexpr uint32_t FNV1a_32_INIT = 0x811c9dc5;
constexpr uint32_t FNV1a_32_PRIME = 0x01000193;
constexpr uint32_t fnv1a_32(const char* str,
                            const uint32_t value = FNV1a_32_INIT) noexcept {
    return *str ? fnv1a_32(str + 1, (value ^ uint32_t(*str)) * FNV1a_32_PRIME)
                : value;
}

// This function returns a constexpr based off the type name
template <typename T>
constexpr const char* TypeName() {
#if __cplusplus >= 202002L
    return std::source_location::current().function_name();
#else
#ifdef _MSC_VER
    return __FUNCSIG__;
#else
    return __PRETTY_FUNCTION__;
#endif
#endif
}

// Typehash computes a compile-time type name hash
template <typename T>
constexpr uint32_t typehash() noexcept {
    return fnv1a_32(TypeName<T>());
}

struct MyComponent {};

int main() {
    constexpr const char* c = TypeName<int>();

    constexpr uint32_t tfloat = typehash<float>();
    constexpr uint32_t tint = typehash<int>();
    constexpr uint32_t tmycomp = typehash<MyComponent>();

    std::cout << c << ", " << tfloat << ", " << tint << ", " << tmycomp
              << std::endl;

    return 0;
}

Working demo: https://godbolt.org/z/MKEx1Yead
You will see in the above that the compilation output assembly only contains constants, no code is run at run-time.

I would like to know if this would be an acceptable solution until a per-world id storage is done and if so, where to exactly insert this code, because I having a hard time finding the right spot.

This would really help me solve the above issue.
Thanks!