Premature optimiztion
the-moisrex opened this issue · 3 comments
I saw your conference, in this slide you seem to be proud of this code:
constexpr ada::scheme::type get_scheme_type(
std::string_view const scheme) noexcept {
if (scheme.empty()) {
return ada::scheme::NOT_SPECIAL;
}
auto const hash_value = static_cast<int>(
(2 * scheme.size() + static_cast<unsigned>(scheme[0])) & 7U);
const std::string_view target = details::is_special_list[hash_value];
if ((target[0] == scheme[0]) && (target.substr(1) == scheme.substr(1))) {
return static_cast<ada::scheme::type>(hash_value);
}
return ada::scheme::NOT_SPECIAL;
}
Here I'm gonna give you a simpler solution that is also faster:
static constexpr std::string_view is_special_list2[] = {
"http", "https", "ws", "ftp", "wss", "file"};
// a little bit of re-ordering here:
enum class type2 : uint8_t {
HTTP,
HTTPS,
WS,
FTP,
WSS,
FILE,
NOT_SPECIAL,
};
constexpr ada::scheme::type2
get_simple_scheme_type(std::string_view const scheme) noexcept {
std::underlying_type_t<ada::scheme::type2> i = 0;
for (; i != 6; ++i) { // hopefully it'll be unrolled
auto const scheme_ith = details::is_special_list2[i];
if (scheme == scheme_ith) {
break;
}
}
return static_cast<ada::scheme::type2>(i);
}
Why?:
- loops get unrolled, hopefully (we can do it manually)
- string_view::size_type is a 64bit integer, and you're multiplying it. Of course that's gonna be slower than comparing 8bit characters.
GCC:
2024-03-20T11:35:18-07:00
Running ./a.out
Run on (32 X 5500 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
Load Average: 0.25, 0.34, 0.53
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
Ada 1.60 ns 1.60 ns 447947046
Simple 0.976 ns 0.975 ns 713186229
LongStrAda 117 ns 116 ns 6137922
LongStrSimple 114 ns 114 ns 6077361
Clang:
2024-03-20T11:36:07-07:00
Running ./a.out
Run on (32 X 5191.05 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
Load Average: 0.50, 0.40, 0.54
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
Ada 1.52 ns 1.52 ns 453872906
Simple 0.897 ns 0.896 ns 765627214
LongStrAda 118 ns 118 ns 5999300
LongStrSimple 120 ns 120 ns 6019850
Unrelated note:
I don't think we really need that enum to be detailed, we just need these values because that's the only ones that affect the algorithms:
enum struct scheme_type : stl::uint8_t {
not_special, // everything else
special_scheme, // http(s), ws(s), ftp
file, // file scheme
};
I didn't make a PR because I wasn't sure if changing the order of scheme::type
will break any other algorithm or not. If it doesn't, let me know.
It is entirely possible for a branchy approach to do even slightly better than the mostly branchless approach that we use in some tests where the protocol is predictable. E.g., if you know that the protocol will be "http", first checking for "http" might save the hashing cost. However, the hashing approach has the benefit of providing consistent worst case performance: you always do at most one string comparison... as opposed to scanning the whole set in the worst case.
I will add a comment in the code.