buildbarn/bb-storage

Using lots of memory seems to slow bb-storage down to a crawl

YngveNPettersen opened this issue · 16 comments

After our Buildbarn (combined with Goma) update in November/December, which changed the cache system to the new "local" system we have had to restart the cache backend several times a week, frequently at least once a day, because the system slows down as memory usage grows.

Until this week, the backend was installed on an older server, with 256 GB RAM, working on a 1.6 TB cache on SATA SSDs in a RAID

This week, the backend (which was also updated to a Jun 12 state) was moved to a modern machine, also with 256 GB RAM, and much more diskspace on NVMe drives, although only ~400 GB is currently used.

What I am observing is that as the memory usage of the bb-storage process passes ~70% of RAM (~180 GB), builds slow down significantly (today I restarted at 75% because a colleague reported slowness when building). RAM usage seems to top out just over ~80% of RAM (~205 GB).

Based on classification of RAM usage, e.g. by top, the memory usage seems to overlap with the memory allocated to file-caching, which also matches my impression that "local" uses memory mapping of files.

My current policy is to restart the backend if I notice bb-storage memory usage start inching towards 60%.

You’ve left out the most crucial thing from your report: what does your configuration look like?

Furthermore, please use Go’s pprof to capture both memory and CPU profiles.

The blobstore configuration should be fairly standard, most of the values are IIRC the recommendations from the README or the source. It specifies a 3.2 TB cache (the previous one had half the size)

  blobstore: {
    contentAddressableStorage: {
      'local': {
        keyLocationMapOnBlockDevice: {
          file: {
            path: 'storage-cas/key_location_map',
            sizeBytes: 2 * 20 * 2*256*1024 * 8 * 16 * 64 ,
          },
        },
        keyLocationMapMaximumGetAttempts: 8,
        keyLocationMapMaximumPutAttempts: 32,
        oldBlocks: 8,
        currentBlocks: 24,
        newBlocks: 3,
        blocksOnBlockDevice: {
          source: {
            file: {
              path: 'storage-cas/blocks',
              sizeBytes: 2* 1400 * 1024 * 1024 * 1024,
            },
          },
          spareBlocks: 3,
        },
        persistent: {
          stateDirectoryPath: 'storage-cas/persistent_state',
          minimumEpochInterval: '300s',
        },
      },
    },
    actionCache: {
      completenessChecking: {
        'local': {
          keyLocationMapOnBlockDevice: {
            file: {
              path: 'storage-ac/key_location_map',
              sizeBytes: 2* 20 * 64*1024 * 8*16*64,
            },
          },
          keyLocationMapMaximumGetAttempts: 8,
          keyLocationMapMaximumPutAttempts: 32,
          oldBlocks: 8,
          currentBlocks: 24,
          newBlocks: 1,
          blocksOnBlockDevice: {
            source: {
              file: {
                path: 'storage-ac/blocks',
                sizeBytes: 2* 100* 1024 * 1024 * 1024,
              },
            },
            spareBlocks: 3,
          },
          persistent: {
            stateDirectoryPath: 'storage-ac/persistent_state',
            minimumEpochInterval: '300s',
          },
        },
      },
    },
  },

Have you tried moving the key-location map and the blocks to a raw block device? My guess is that performance at high utilisation rates simply becomes bad, because of high amounts of fragmentation and/or poor block allocation performance.

No I haven't; as we are only using SSDs (as mentioned, the newest system is using 3GB/s NVMes, ext4 filesystem) I would think that filesystem fragmentation is not an serious problem.

Also, I don't think the filesystem is all that relevant, since restarting the bb-storage process gets the speed back up to normal, which makes me think the memory organization is the problem area. (Just guessing, but one possible area to investigate is how the data structures are organized and what the access time complexity is?)

(for reference, what we are building is a Chromium based application, total number of work items in ninja is >60K. probably ~50K are remote compiled in a full build)

(FYI, about 2 hours ago I restarted the backend processes, and bb-storage has already passed 50% RAM usage)

restarting the bb-storage process gets the speed back up to normal

Which is most likely because closing a file forces a flush of all dirty pages.

Also, that's a pretty big key-location map you have there. It's 160 GB, meaning it can hold 160 GB / 66 B = 2.6 billion entries. With 2.8 TB of blocks space, you're roughly tuning for an average object size of 2.8 TB / 2.6 billion = 1155 B, which I don't think is realistic. Especially considering that LocalBlobAccess currently stores objects at a block granularity (4 KB? Maybe even more? Depends on your disk/configuration). I can well imagine that LocalBlobAccess performs poorly if ~62.5% of your RAM is used to store a disk-backed hash table.

I increased the size of those significantly after #103 because I hit problems apparently caused by collisions. In the old system these sizes were half of the ones posted above (I doubled them all with the new drive).

The calculation is based on number of different major branches we build for (at least 2), build configurations (2, release, debug), and the number of platforms (8) we build for, with ~50+K build items for each (plus source files, the chromium repo is 500K files, all types, ~200K C-type files, used or not); a single Windows Chromium build totals 55GB all 46K object files (which means a minimum full build is 100K cached files (plus at least as many extras like stdout, stderr) , not counting all the unique header files (up to 100K).

IOW, a conservative estimate for Windows is 300K cached items (assuming I have included all categories. The 150K source files might be shared across a branch and multiple platforms). Other platforms will have at least the same (several have up to 50% more build items than Windows). That totals at least 10M items, and the hash table require a multiplication factor of minimum 10, so 200M items for individual in the table is an absolute minimum (assuming each cache file has just one item in the hash table) (So the current configuration may have an extra safety factor of ~10, but the numbers only assume major branches, not how many major rebuilds might be triggered due to central header file or other major changes).

BTW; currently the new cache only consumes 450GB, and we still hit the performance issue once RAM usage exceeds ~70%

I just restarted the backend, after it had reached 80% and was slowing down a lot.

Before restarting I downloaded the heap, alloc, and routines textfiles (with a browser) from the gprof website on the backend. A zip of that is attached
vivaldi-goma_backend_pprof.zip

An update: We are also seeing that the backend can slow down significantly even with much smaller memory usage. What we have noticed about these cases is that a system process "kworker/u256:0+flush-253:4" is running with a full core when this happens. at least sometimes it stops after a while, and performance is back at normal.

We also observed a speed reduction if the backend had been running for several days.

At present we are working around the high memory issue and the long running issue by automatically restarting when it reaches 65% memory usage, and also restarting every night

Are you using raw block devices now, or still files on a file system?

The system have not been changed, so it is using a normal ext4 file system.

Changing that would be a major operation, even more so because AFAICT there are no deployment examples for such a scenario. (And in any case July is a vacation month)

Considering that there doesn't seem to be willingness to test solutions that are being presented, I'm hereby going to close this issue. Be sure to continue investigating this issue at your own convenience.

Did I say I was unwilling to test it?

I said:

  1. Testing your suggestion is a major operation, as it requires significant filesystem reconfiguration (thus involving system administrators) of first the test system, then the production system, as well as setting up new configurations (see point 3). It would probably require more than a week of experimentation. It could also not be started until time could be set aside to actually do the testing and deployment (see point 2 and 3).

  2. There have not be time do do that kind of testing since this report was filed, due to vacation time, and major priority projects both before and after the vacation have not allowed any further time consuming investigations so far, and probably won't for another few weeks.

  3. There is insufficient information about how your suggestion should be implemented (especially in the configuration file), or information about where I can find it in easily accessible example form. I am therefore unable to perform the tests without spending extensive additional time trying to figure out those details (especially since the way the fields and structures are named in different formats in Go and its proto system, not just different upper/lower casing, but also with or without underscores, makes it difficult to cross-reference them and their usage), which means having to set aside several weeks for such a test (and additionally I would not be able to discover if the slowness issue is gone before the production system have been updated and have been running with it for several weeks without our current workarounds).

  1. There is insufficient information about how your suggestion should be implemented (especially in the configuration file), or information about where I can find it in easily accessible example form.

Let's look at that configuration file of yours. If we wanted to use a raw block device instead of a file, maybe we should look at those options that you set, named keyLocationMapOnBlockDevice and blocksOnBlockDevice. Those sound like they have something to do with it.

I am therefore unable to perform the tests without spending extensive additional time trying to figure out those details (especially since the way the fields and structures are named in different formats in Go and its proto system, not just different upper/lower casing, but also with or without underscores, makes it difficult to cross-reference them and their usage),

See Protobuf's JSON mapping for details. Anyway, assume we absolutely don't know what that means, let's try this:

$ git grep -li 'key.*location.*map.*on.*block.*device'
README.md
doc/zh_CN/README.md
pkg/blobstore/configuration/new_blob_access.go
pkg/proto/configuration/blobstore/blobstore.pb.go
pkg/proto/configuration/blobstore/blobstore.proto

Let's look at all of these files. Eventually we'll stumble upon blobstore.proto, which contains this nice excerpt:

    // Store the key-location map on a block device. The size of the
    // block device determines the number of entries stored.
    buildbarn.configuration.blockdevice.Configuration
        key_location_map_on_block_device = 12;

Hmmm... What's that buildbarn.configuration.blockdevice.Configuration thing? Let's open up blockdevice.proto.

message Configuration {
  oneof source {
    // Let the block device be backed by a device node provided by the
    // host operating system.
    string device_path = 1;

    // Let the block device be backed by a regular file stored on a file
    // system. This approach tends to have more overhead than using a
    // device node, but is often easier to set up in environments where
    // spare disks (or the privileges needed to access those) aren't
    // readily available.
    //
    // Using this method is preferred over using tools such as Linux's
    // losetup, FreeBSD's mdconfig, etc.
    FileConfiguration file = 2;
  };
}

Aha! So instead of this file thing that you specify in your config, maybe you can fill in device_path (or devicePath in case of Protobuf's JSON mapping) instead. According to the code above, it takes a string. Maybe try a string that starts with "/dev/"?

which means having to set aside several weeks for such a test

I agree that it takes some getting used to, but the derivation that I made above isn't exactly rocket science. Several weeks of time should even be more than enough to read the entire bb-storage source code for that matter. It's only 16K LOC if you disregard all tests and machine generated code.

Buildbarn is only a relatively small project, with <1.5 FTE of people doing software engineering work on it, who at the same time need to keep their own Buildbarn setups running. The ability to share this work with the Open Source community is a luxury. Deep technical discussions about how to improve things are appreciated, but questions on how to solve your homework are not.

An update: A few weeks before I would have had time to start experimenting we noticed that the slowness problem had been gone for a couple of weeks (noticing that there are no new complaints can take a while). Investigating, we discovered that, while the kworker flush was still showing up regularly in top, it was not using the amount of CPU time it had been, ending its activity in seconds, and on closer inspection it seemed like the disk activity was on the newer, second physical NVMe drive. When the issue returned a little while later, the activity was seemingly on the older NVMe drive (the mount spanned two drives, with 20% of the total space on the older drive).

We then removed the older drive from the mount, only using the newer drive, and AFAICT we have not had the slowness issue since we made that change 3 months ago.

This could indicate that the issue was a problem with the older NVMe drive.

We did also see the issue in the old storage backend, although we never investigated enough to know if the kworker was an issue then. That machine have older Sata2 SSDs in a RAID6. If write speed is indeed one source of the problem, it might be that the old NVMe drive has performance issues under some conditions (that now-storage backend machine was previously the main worker, using that old NVMe as a cache drive in the year+ before we updated the system with more workers and migrated the cache location).

At present I think we are going to keep the current cache organization, and only using the newer NVMe drive as a cache storage area.

For reference, some web pages generally discussing kworker flush performance speculated that filesystem fragmentation may be involved, and that allocating files 1GB or more at at time using a specific system call might help. (I am wondering, though, if the size of files or segments of them loaded into memory could play a role)

I wrote on July 1st:

Have you tried moving the key-location map and the blocks to a raw block device? My guess is that performance at high utilisation rates simply becomes bad, because of high amounts of fragmentation and/or poor block allocation performance.

To which you responded:

No I haven't; as we are only using SSDs (as mentioned, the newest system is using 3GB/s NVMes, ext4 filesystem) I would think that filesystem fragmentation is not an serious problem. Also, I don't think the filesystem is all that relevant, since [snip]

Now almost five months later you post this:

For reference, some web pages generally discussing kworker flush performance speculated that filesystem fragmentation may be involved

Please be more considerate of other people's time.

I doubt fragmentation was a major contributor in this particular case especially since there was a difference in performance between the old and the new drive, as well as the older SSD RAID (on a different machine); and nothing else used the mounts. I suspect it was hardware performance differences, not fragmentation, considering that we are not seeing the issue when only using the newer drive (If fragmentation was a real factor, we should still be seeing it). (Also, considering that the drives are all SSDs I would, perhaps naively, assume fragmentation would not be an issue at all, while preparing many new blocks for writing might OTOH be a performance issue, which might indicate that keeping updates small, within a block might significantly enhance performance),