Shopify/packwerk

[Bug Report] Max RSS is Significantly Higher in Version 2.0

jordanstephens opened this issue · 16 comments

Description
Memory usage in packwerk v2.0 is significantly higher than it was in v1.4—so much higher that we will need to increase the resource class of our CI nodes in order for us to run packwerk v2.0. It looks like the max RSS for the whole process tree has increased from ~850 MB to about ~2.5GB for our use-case.

I ran gnu time -v on bin/packwerk check in our application working directory using both versions of packwerk and saw the following results:

Version 1.4 Version 2.0

Command being timed: "bin/packwerk check"
User time (seconds): 79.86
System time (seconds): 12.22
Percent of CPU this job got: 719%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.79
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 83560
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 399579
Voluntary context switches: 10258
Involuntary context switches: 179007
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 36
Page size (bytes): 4096
Exit status: 0

Command being timed: "bin/packwerk check"
User time (seconds): 81.06
System time (seconds): 11.77
Percent of CPU this job got: 381%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.34
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 334968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 5862
Minor (reclaiming a frame) page faults: 1233804
Voluntary context switches: 35237
Involuntary context switches: 156020
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 16
Socket messages received: 16
Signals delivered: 40
Page size (bytes): 4096
Exit status: 0

To Reproduce
Run packwerk while watching memory usage, compare between v1.4 and v2.0

Expected Behaviour
I expected memory usage to not increase so dramatically between v1.4 and v2.0.

Screenshots

The GNU time output only captures the max RSS of the root process, so I grabbed a couple screenshots from activity monitor to demonstrate that the memory usage is much higher in each of the child processes as well.

Packwerk v1.4 Memory Usage
packwerk-v1 4-memory-usage

Packwerk v2.0 Memory Usage
packwerk-v2 0-memory-usage

Version Information

  • Packwerk: v2.0
  • Ruby v3.0.2

Additional Context

I haven't looked into the changes yet, but I assume this is because packwerk now loads our rails application in each of its processes? I still wouldn't expect each child process's max RSS to balloon up so much.

Do you all have any idea why this might be? And secondly, any idea if it would be possible and reasonable to bring this number back down?

Thanks!

ngan commented

I'm guessing this make sense and it sort of expected.

Parallel.flat_map(@files, &process_file)

Parallel will default to forking by the cpu count. And since we're in a context of the Rails application, it'll multiply the memory footprint dramatically (depending on the number of CPUs you have).

Off the top of my head, I can see 2 ways of resolving this issue: (1) switch to threads, or (2) load the rails environment only to extract the information we need. I'll outline the choices below:

Switch to threads

Switching to threads is a little tricky because we'll have to deal with thread issues. It's not too bad since we're not actually serving requests or anything complicated (other than loading the app), the only threading issue we'll need to deal with is auto loading. We're at the mercy of the application on what it does in the test environment (whether or not it eager loads). A solution would be to eager load the app (Rails.application.eager_load!) right before we do the Parallel.flat_map call. I tried this and it works. However, the performance is not as good as forking. For some data, our monolith (37946 files scanned) does the check in 37.57 seconds using forks, and 222.46 seconds using threads (6x slower!). It solves the memory issue and we might be able to investigate the performance difference separately.

Load rails and extract what we need

We could boot the rails app and dump (or however we want to communicate between the rails app and the check process) what we need. Eg load paths, inflections, etc. This allows us to stay with forks, still utilize spring to boot the application, keep the packwerk scanner process memory low.

Thank you for reporting this, Jordan, and for your analysis, Ngan.

When @alexevanczuk and I discussed this change, we didn't measure memory usage, only run time.

As an added data point, I just checked memory usage on our monolith and it doesn't seem to have increased substantially with packwerk 2.0.0. However, we were already using a springified binstub with packwerk 1.4, and if I disable spring on 1.4 I get significantly lower memory usage, comparable to the difference that Jordan is reporting.

It's an interesting problem, since we should in theory only need to load a small part of the application to get the load paths, but in practice, we can't really know which part that is.

@ngan , why do you think we need to eager load the app? I would assume that loading the bundle and executing the initializers should be enough to initialize load paths & inflections.

secondly, any idea if it would be possible and reasonable to bring this number back down?

@jordanstephens two workarounds come to mind for a short-term solution:

  • revert to packwerk 1.4.0, currently there are no differences in behavior
  • reduce parallelism; trade execution time for memory usage.

I haven't made up my mind on what the longer term course of action should be. Personally, I think the reduced complexity from removing the application caches may be worth the added memory usage - we're trading human effort for machine effort.

Hmm this is an interesting problem. Great analysis @jordanstephens and @ngan!

I like the automated determination of load paths for Packwerk, but I don't like that we need to load the entire Rails app just to get them. It's like having a dependency on an abstract library, very unpredictable. I don't know much about Rails internals but I wonder if there is anything we can do to pull the load path logic into a (Ruby) file with minimal dependencies that Packwerk could run. This is obviously a longer term solution but feels like it could be a good architectural change for Rail as well. The same might work for inflections as well.

In the near term, I wonder if it is worth introducing an interface inside Packwerk to provide the load paths and inflections, and have the default implementation just read from the Rails app as we do now. An alternative implementation might read from the config file as we did before. A future implementation might take advantage of future interfaces in Rails that allow us to get this information without load the entire app.

Sorry for the "Why not both?" proposal 😆

ngan commented

@ngan , why do you think we need to eager load the app?

Hm, that's a good question... when I ran it with threads without eager loading, I got some errors. One of them, iirc, was "SyntaxError: dynamic constant assignment error" in one of the files. There were many other errors as well. But looking through packwerk, I can't see anything that actually references the app's constants causing it to autoload.

@mclark autoload paths are derived during framework boot. A lot of other things can mutate the paths on boot as well, including gems, and the app itself. It may not seem like it, but it's actually very predictable for a Rails app to have its paths ready to go. It's actually a public API. I really like booting up the app to find these things out as it allow apps to do more custom things to autoload paths. And as long as Packwerk can discern from the public api, we should be good.

I want to solve this problem as well because, right now, it's forking 10x with 10x the memory usage:
image

Hey all --
Thanks for this report and for the great discussion around it.
I've been looking into this issue and seeing what it would look like to implement @ngan's suggestion of extracting what we need from Rails in a separate isolated process. The results are here: #166

I'm pairing with @ngan tomorrow to chat through this and get his feedback, but I'd love your feedback as well.

I expect that if you pull this version and use it your CI environment it should resolve the memory issue. I also expect that bin/packwerk check and bin/packwerk update-deprecations should have the same behavior as the version on main.

Here are the results before and after this change on our codebase.

While it does resolve the memory issue, it does slow down bin/packwerk check on one file quite a bit, presumably we are no longer using spring to load the bundle in the main process.

Memory Usage Results

Before

ps aux | grep packwerk

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      5683 23.5  3.1 943504 509392 ?       Ssl  08:56   0:02 packwerk
root      5687 67.7  3.1 812508 508412 ?       R    08:56   0:04 packwerk
root      5688 60.4  3.1 812468 508164 ?       R    08:56   0:04 packwerk
root      5689 64.1  3.1 812508 508204 ?       R    08:56   0:04 packwerk
root      5690 61.5  3.1 812468 507968 ?       R    08:56   0:04 packwerk

After

ps aux | grep packwerk

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      5511  8.6  0.7 381564 113580 pts/3   Sl+  08:53   0:07 ruby bin/packwerk check
root      5535 76.6  0.5 350820 95408 pts/3    R+   08:54   0:46 ruby bin/packwerk check
root      5536 75.8  0.5 343832 86344 pts/3    R+   08:54   0:46 ruby bin/packwerk check
root      5537 77.3  0.5 351060 96540 pts/3    R+   08:54   0:47 ruby bin/packwerk check
root      5538 77.0  0.5 345856 89456 pts/3    R+   08:54   0:47 ruby bin/packwerk check

Speed Results

Before

DevSpace(development) ./ $ spring stop
Spring stopped.
DevSpace(development) ./ $ time bin/packwerk check packs/gusto_slack/app/services/slack/client.rb
[TEST PROF INFO] Spring detected
Running via Spring preloader in process 6551
📦 Packwerk is inspecting 1 file
.
📦 Finished in 1.26 seconds

No offenses detected
No stale violations detected

real	0m14.566s
user	0m0.153s
sys	0m0.016s
DevSpace(development) ./ $ time bin/packwerk check packs/gusto_slack/app/services/slack/client.rb
Running via Spring preloader in process 6566
📦 Packwerk is inspecting 1 file
.
📦 Finished in 1.25 seconds

No offenses detected
No stale violations detected

real	0m2.603s
user	0m0.146s
sys	0m0.012s

After

DevSpace(development) ./ $ spring stop
Spring stopped.
DevSpace(ae-use-local-packwerk) ./ $ time bin/packwerk check packs/gusto_slack/app/services/slack/client.rb
fatal: not a git repository (or any of the parent directories): .git
📦 Packwerk is inspecting 1 file
.
📦 Finished in 1.71 seconds

No offenses detected
No stale violations detected

real	0m29.934s
user	0m8.133s
sys	0m2.030s
DevSpace(ae-use-local-packwerk) ./ $ time bin/packwerk check packs/gusto_slack/app/services/slack/client.rb
fatal: not a git repository (or any of the parent directories): .git
📦 Packwerk is inspecting 1 file
.
📦 Finished in 1.58 seconds

No offenses detected
No stale violations detected

real	0m13.368s
user	0m8.576s
sys	0m1.975s

Hey team --
Ngan and I began pairing on this.
We are actively working on this and happy to settle on whatever solution makes sense.

We had a couple of questions to help build context and make sure we come to a good outcome.

What's the urgency of this problem for you all? If we absolutely want to resolve this memory bloat issue (now), we could merge #166, which we believe is a bit less of an ideal solution. We thought may be a bit more ideal instead to reorganize things so the main process utilizes spring and calls out to another plain ruby process that requires the necessary gems (and only the necessary gems) and then parses files. This would be more performant because we would pay fewer total loading costs (see below for more details of this analysis).

Here are the options as I see them:

  1. Do nothing and accept the worse memory characteristics. We pay the cost of worse memory characteristics.
  2. We land the above PR and sit on it until a non-rails based parser implementation can be invested in (Shopify/constant_resolver#26). We pay the worse performance (basically back at the performance characteristics before packwerk was properly springified).
  3. We land the above PR and follow up at some point with the "more ideal" approach. We temporarily pay the worse performance, and also pay the cost of complexity thrashing (removing rails, adding it back, removing spring, adding it back, etc.)
  4. We close the above PR and focus our efforts on a more ideal approach. We pay the memory penalty for longer, but avoid the cost of things thrashing around. Again, we're fully committed to continuing to work on this while it's in this state.

Performance comparison of approaches

Today

Today the main process is springified, which means that loading the entire bundle and rails is done via spring, and thus very fast. The only penalty we pay is packwerk parsing logic (which every approach pays).

Loading rails in separate process

#166

In this, the main process is NOT springified, so it pays the cost of manually loading the entire bundle (which is the main source of the performance regression). It then shells out to bin/rails runner which is springified, so we pay the same cost today of loading a springified bundle and rails process. Then in the main process we also pay the normal cost of packwerk parsing logic.

To my knowledge there is not a way to only load the bundle with spring and not the main application.

Loading rails in the main process, packwerk in a separate process

In this more ideal solution (from a performance standpoint), we do everything we do today, except we spawn a new process (not forked) and manually require exactly what we need. In this case, we only overpay the cost of loading packwerk/ast/etc. again, but do not pay the cost of loading up the rest of the bundle again. We believe this approach should more or less get us to the same performance characteristics today.

What's the urgency of this problem for you all?

At Shopify, I have not heard any complaints about the increased memory usage.

In terms of solutions, I've just started looking into this, but the increased start up time from #166 has me concerned.

Having a separate process that doesn't load the whole bundle seems interesting. I'm also wondering if we can get a speedup by not loading anything in the development and test groups except packwerk - the application should be bootable without those things, and I'm curious what kind of speedup / memory usage reduction that would result in. I'm not super optimistic though because it's not a large part of the bundle.

I haven't had a chance to look at the proposals yet, but I can say that we were able to work around the issue by reducing parallelism for now, though I would still like to see this improved.

I think booting rails in a side process and sending the load paths back to the main process before the fork sounds like a good implementation to me, but I need to look into it more closely. From a high level, though, I would support pursuing a more optimal solution over thrashing through a short-term fix.

Thanks for the comments @exterm and @jordanstephens.

For now, based on the feedback I've received which all makes sense, I think I'll close the proposal PR in favor of a better longer term solution.

Once everyone is back from the holidays, I'd love to (A) align on what tradeoffs feel like the right ones (complexity/maintainability, memory, performance, etc and (B) align on an implementation for delivering on those tradeoffs.

Hey all -- hope your holiday season was relaxing and safe.

Can you please share with me some of your thoughts on the tradeoffs that are important to you? My understanding right now is that we're trading off machine time/resources for human time, and that we may be okay with that tradeoff. Let me know how you feel.

If we wanted to not have this tradeoff, a medium term solution might involve packwerk commands loading rails and getting what we need, then running the rest of packwerk in an independent process that is given the rails dependencies. I see that as very similar to the previous proposal, except reversed (instead of rails dependencies being a separate process, packwerk is). Both of these solutions add a lot of complexity to the implementation, so our preference may be to wait until a longer-term solution. My ideal for a longer-term solution would be to see if we can use a non-zeitwerk based parser for Packwerk. @exterm did some investigation on that here: Shopify/constant_resolver#26. My ideal for this would be sorbet of course, but there may be some separate challenges to make that happen.

Let me know what you're all thinking about this!

non-zeitwerk based parser for Packwerk

Let's not confuse two things here:

  1. Constant resolution, which is currently based on zeitwerk semantics. That means, finding which file defines a given constant. This is what we currently load the application for, because we need to know load paths and inflections to imitate what zeitwerk is doing. This creates the bulk of memory usage per process.
  2. The parser, which parses ruby code into an AST for us, slowly - we're currently using https://github.com/whitequark/parser. My theory is that if we were to use a faster parser like RubyVM::AbstractSyntaxTree or ripper, we wouldn't have to parallelize check or update_deprecations and could run everything in the same process, thereby cutting memory usage considerably even if we're still loading the application once per process.

Sorbet implements both of those things so there is a theoretical potential to use it for both, but they are separate problems and solving even just one of them could solve the memory usage problem.

Thanks for making that distinction more clear @exterm

I think that makes sense. To reiterate what we chatted about over video call, I think that moving to a faster parser should give us a lot of wiggle room to move to a thread-based approach to address this memory issue more simply.

Rails is very good with copy-and-write, that is how puma and unicorn work in production without each process having the entire Rails application in resident memory, so if we boot the application in the master process, before we fork the other processes, the resident memory of each one of those process would be only for the new allocated or modified memory.

Since it's been three months: Is there a status update regarding this issue that you guys can share?

Thanks for all your work on this!

Hi @fhwang I believe we've discovered (per Rafael's comment) that this is not an issue because Rails is good at being able to reuse memory space for forked processes.

I'm going to close this for now! Please leave a comment or reopen if you or anyone sees more issues.