Premature optimiztion

Question

Premature optimiztion

the-moisrex opened this issue a year ago · 3 comments

I saw your conference, in this slide you seem to be proud of this code:

constexpr ada::scheme::type get_scheme_type(
    std::string_view const scheme) noexcept {
  if (scheme.empty()) {
    return ada::scheme::NOT_SPECIAL;
  }
  auto const hash_value = static_cast<int>(
      (2 * scheme.size() + static_cast<unsigned>(scheme[0])) & 7U);
  const std::string_view target = details::is_special_list[hash_value];
  if ((target[0] == scheme[0]) && (target.substr(1) == scheme.substr(1))) {
    return static_cast<ada::scheme::type>(hash_value);
  }
  return ada::scheme::NOT_SPECIAL;
}

Here I'm gonna give you a simpler solution that is also faster:

static constexpr std::string_view is_special_list2[] = {
    "http", "https", "ws", "ftp", "wss", "file"};

// a little bit of re-ordering here:
enum class type2 : uint8_t {
  HTTP,
  HTTPS,
  WS,
  FTP,
  WSS,
  FILE,
  NOT_SPECIAL,
};

constexpr ada::scheme::type2
get_simple_scheme_type(std::string_view const scheme) noexcept {
  std::underlying_type_t<ada::scheme::type2> i = 0;
  for (; i != 6; ++i) { // hopefully it'll be unrolled
    auto const scheme_ith = details::is_special_list2[i];
    if (scheme == scheme_ith) {
      break;
    }
  }

  return static_cast<ada::scheme::type2>(i);
}

Why?:

loops get unrolled, hopefully (we can do it manually)
string_view::size_type is a 64bit integer, and you're multiplying it. Of course that's gonna be slower than comparing 8bit characters.

GCC:

2024-03-20T11:35:18-07:00
Running ./a.out
Run on (32 X 5500 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 2048 KiB (x16)
  L3 Unified 36864 KiB (x1)
Load Average: 0.25, 0.34, 0.53
--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
Ada                 1.60 ns         1.60 ns    447947046
Simple             0.976 ns        0.975 ns    713186229
LongStrAda           117 ns          116 ns      6137922
LongStrSimple        114 ns          114 ns      6077361

Clang:

2024-03-20T11:36:07-07:00
Running ./a.out
Run on (32 X 5191.05 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 2048 KiB (x16)
  L3 Unified 36864 KiB (x1)
Load Average: 0.50, 0.40, 0.54
--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
Ada                 1.52 ns         1.52 ns    453872906
Simple             0.897 ns        0.896 ns    765627214
LongStrAda           118 ns          118 ns      5999300
LongStrSimple        120 ns          120 ns      6019850

Unrelated note:

I don't think we really need that enum to be detailed, we just need these values because that's the only ones that affect the algorithms:

    enum struct scheme_type : stl::uint8_t {
        not_special,    // everything else
        special_scheme, // http(s), ws(s), ftp
        file,           // file scheme
    };

Benchmark Soruce

Answer 1 · 2024-03-20T18:47:49.000Z

I didn't make a PR because I wasn't sure if changing the order of scheme::type will break any other algorithm or not. If it doesn't, let me know.

Answer 2 · 2024-03-20T19:32:56.000Z

It is entirely possible for a branchy approach to do even slightly better than the mostly branchless approach that we use in some tests where the protocol is predictable. E.g., if you know that the protocol will be "http", first checking for "http" might save the hashing cost. However, the hashing approach has the benefit of providing consistent worst case performance: you always do at most one string comparison... as opposed to scanning the whole set in the worst case.

I will add a comment in the code.

Answer 3 · 2024-03-20T19:53:11.000Z

See PR #618 where the design choice gets explained.