Is there a string size limitation when creating the lookup table?

Question

Is there a string size limitation when creating the lookup table?

Opened this issue 4 months ago · 2 comments

Hi, @krzysztof-jusiak

This is a really cool library! When reading the code I was left with 2 questions:

Is there a limitation on the string sizes that we pass to the lookup table? This is relevant for the use case that I am working with. And if there is such a limitation, is there a way to specify a fallback logic for when the limit is surpassed?
Is there any harm/slow-down at not passing a <key, value> pair and using just a list of keys? Like for example:

constexpr auto dict = std::array<std::string_view, 3>{
    "make"sv,
    "model"sv,
    "year"sv
};

int main() {
    static_assert(mph::find<dict>("make") == true);
    static_assert(mph::find<dict>("model") == true);
    static_assert(mph::find<dict>("year") == true);
    static_assert(mph::find<dict>("unknown") == false);
}

Thanks,
Francisco

Answer 1 · 2024-09-10T03:50:21.000Z

Hi @FranciscoThiesen ,

Yeah, there is a limit of string lengths and different optimizations are used based on the max-length string found in the set. So there is a difference in generated assembly for <= 4 maxlen, <=8 maxlen where the mov and pext can be optimized. For maxlen >8 the generation is more likely to fail as well and that will be dependent on the input set uniqueness when trying to find the mask. There is also storage implications based on the size and the cache utilization will differ as pairs of {maxlen, type required to store ids} is used. So for example, for strings <=4 it's just u32 for <=8 it's u64, for <=16 u128, for u32 __m256 (avx2), for <=64 __m512 (avx512). There are options to do second lookup table to limit the storage etc. as well as having a fallback. Lookup and get are SFINE friendly so in case the solution can be used that can be checked with requires or similar techniques. lookup and find themselfs have different backups based on the input strings. So fro example, lookup will try to do magic_lut (which stores the whole lookup table in a single register) but that's rarely possible so then it's going for pext solution with masking (which is likely to find a solution). There is also a way to simply write backup policies based on specific use case - example here - https://github.com/qlibs/mph/blob/main/mph#L1152.
There is no slowdown associated by not providing ids, they will be automatically added and the same optimizations will be triggered.

constexpr auto dict = std::array<std::string_view, 3>{
    "make"sv,
    "model"sv,
    "year"sv
};

is equivalent to writing

constexpr auto dict = std::array<std::string_view, 3>{
    pair{"make"sv, 0}
    pair{"model"sv, 1}
    pair{"year"sv, 2}
};

Answer 2 · 2024-09-10T17:58:14.000Z

Thanks for the fast response, @krzysztof-jusiak

Is there an example of how can I provide a fallback mechanism to handle with strings with size > 64?

I saw on the other open issue that taking a segment of the string is a possibility, but I can't guarantee the uniqueness of a certain segment of size <= 64 for the strings on my set.