snowleopard/hadrian

Oracles.ModuleFiles: performance, correctness and tests

snowleopard opened this issue · 27 comments

Oracles.ModuleFiles is an important oracle in the build system whose role is to find and cache module source files. More specifically (also see update below):

  • It takes a list of module names modules and a list of directories dirs as arguments.
  • It computes a sorted list of pairs of the form (A.B.C, dir/A/B/C.extension), such that A.B.C belongs to modules, dir belongs to dirs, and file dir/A/B/C.extension exists.

For example, for compiler package given modules = ["CodeGen.Platform.ARM", "Lexer"], and dirs = ["codeGen", "parser"] it produces [("CodeGen.Platform.ARM", "codeGen/CodeGen/Platform/ARM.hs"), ("Lexer", "parser/Lexer.x")].

The implementation is pretty sophisticated and currently has no automated tests: https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Oracles/ModuleFiles.hs.

Let us discuss possible correctness and performance issues here. I will also attempt to decompose the code into testable pieces and add some automated tests.


Updated definition given in #210 (comment):

This is an important oracle whose role is to find and cache module source files. More specifically:

  • It takes a list of directories dirs and a sorted list of module names modules as arguments.
  • For each module, e.g. A.B.C, it returns a FilePath of the form dir/A/B/C.extension, such that dir belongs to dirs, and file dir/A/B/C.extension exists, or Nothing if there is no such file. If more than one matching file is found an error is raised.

For example, for the compiler package:

  • Given dirs = ["compiler/codeGen", "compiler/parser"]
  • and modules = ["CodeGen.Platform.ARM", "Lexer", "Missing.Module"], it returns
  • [Just "compiler/codeGen/CodeGen/Platform/ARM.hs", Just "compiler/parser/Lexer.x", Nothing].
hvr commented

Btw just to expand on the email I sent earlier: the issue about cabal traversing everything is this one: haskell/cabal#3019

In my case it's even worse, as in some projects I have many files and folders in my project folder which may be data generated by my program or tests, or other stuff which cabal is not supposed to look at, but I still keep in the cabal project. Then there's also issues about traversing running into (auto)mountpoints if cabal starts looking in places it isn't supposed to.

Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via build-tools before cabal even starts looking/probing for .x or .y extensions. We'd still have the problem of probing for .lhs/.hs though.

It looks like a moderately complex solution to a moderately complex problem - it seems about right for the difficulty of the problem it's solving. Do you have any profile measurements that show it is a bottleneck? And I can see automated tests, namely the Travis pieces. This strikes me very much as a task that is difficult to calculate, but easy to check, since if you get it wrong nothing will work - and thus we are checking it extensively.

Only one note from reading through it - there's no point caching something that is computed by an oracle so directly. Every oracle call is cached, so you can drop the newCache bit.

In my case it's even worse, as in some projects I have many files and folders in my project folder which may be data generated by my program or tests, or other stuff which cabal is not supposed to look at, but I still keep in the cabal project.

@hvr I only traverse folders of the form dir/A/B, where dir is listed in hs-source-dirs, and A.B.C is one of the modules. It would be strange to put anything not related to sources to hs-source-dirs, but I can see how this can lead to performance issues.

Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via build-tools before cabal even starts looking/probing for .x or .y extensions.

Making cabal files more explicit would certainly simplify some of the work the build system is doing. On the one hand it will remove some nice magic, but on the other hand some rather unpleasant surprises (and associated complexity) will be gone too.

Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via build-tools before cabal even starts looking/probing for .x or .y extensions. We'd still have the problem of probing for .lhs/.hs though.

I think this is a good idea; the requested build-tools can be used to limit the extensions to search for. I still think we should be caching the results of this search, though. Actually, the same applies to .lhs/.hs, too. These facts change infrequently and computing them is slow enough that caching should provide a good speed-up.

One thing that is unclear to me: how does Shake handle cache invalidation in this case? For example, if I'm developing a package and I move Foo.x to Foo.y, what (if anything) is done to recognize that the pre-processor has changed?

Do you have any profile measurements that show it is a bottleneck?

@ndmitchell I don't have any profiles, but when this was implemented naively, zero builds took minutes for me. So the optimisation was a forced one.

This strikes me very much as a task that is difficult to calculate, but easy to check, since if you get it wrong nothing will work - and thus we are checking it extensively.

That's true, but I would want to have some assurance that the current solution will keep working when/if we start building new packages in future (e.g., dph).

Only one note from reading through it - there's no point caching something that is computed by an oracle so directly. Every oracle call is cached, so you can drop the newCache bit.

Oh. Silly me. This means most newCache calls are redundant. Thanks!

One thing that is unclear to me: how does Shake handle cache invalidation in this case? For example, if I'm developing a package and I move Foo.x to Foo.y, what (if anything) is done to recognize that the pre-processor has changed?

@ttuegel The following should happen:

  • Rule buildPackageLibrary will need file Foo.o, because module Foo is listed in the cabal file.
  • Rule compilePackage will need file Foo.hs because it is listed as a dependency of Foo.o by ghc -M.
  • Rule generatePackageCode will be asked to build Foo.hs. It will call the ModuleFiles oracles that will report back the pair (Foo, "Foo.y"). By examining the extension we will figure out that Happy needs to be invoked to build Foo.hs. We will also need the source Foo.y to make sure it is up to date.
  • Importantly, the call to ModuleFiles is cached, so the results are reused by multiple build rules in the build system. However, we still need to traverse the source tree at least once, as otherwise we will be unaware of the Foo.x to Foo.y change. This is why zero build takes several seconds I believe.

I find that idea that A.B.C can be anywhere but src/A/B/C.hs to be a little bonkers. At everywhere I've worked there was a precise isomorphism between module name and file name, which is absolutely necessary when you have 10,000+ modules. Even end user tools were not named Main, but used -main-is. As a human, I find this the only way to develop software so I can find the modules. Alas, the world doesn't work that way </rant>

@ttuegel - things like doesFileExist and getDirectoryFiles are tracked. If they change, then the cache will be marked dirty and rerun. The downside is that it means these queries must be run on every build, to see if they change.

@snowleopard profile before optimising - I appreciate a naive implementation is insufficient, but this one might be enough. I think there are a few tweaks/simplifications that might be possible, but it's in the ballpark I was expecting.

I suggest that for each of the 3 top-level functions in this module you give a couple of representative queries/answers (edited for length), exactly like you did on this ticket. With that, someone should be able to implement it from scratch without knowing anything more, and then I might be able to give more feedback.

I suggest that for each of the 3 top-level functions in this module you give a couple of representative queries/answers (edited for length), exactly like you did on this ticket. With that, someone should be able to implement it from scratch without knowing anything more, and then I might be able to give more feedback.

@ndmitchell Sure, I will do that.

I find that idea that A.B.C can be anywhere but src/A/B/C.hs to be a little bonkers. At everywhere I've worked there was a precise isomorphism between module name and file name, which is absolutely necessary when you have 10,000+ modules. Even end user tools were not named Main, but used -main-is. As a human, I find this the only way to develop software so I can find the modules. Alas, the world doesn't work that way

👍

things like doesFileExist and getDirectoryFiles are tracked. If they change, then the cache will be marked dirty and rerun. The downside is that it means these queries must be run on every build, to see if they change.

Having to run doesFileExist on every module for every rebuild actually doesn't sound too bad, compared to what we're doing now in Cabal.

However, we still need to traverse the source tree at least once, as otherwise we will be unaware of the Foo.x to Foo.y change.

So, if I move Foo.x to Foo.y, the doesFileExist check will fail, forcing us to re-scan the source tree for Foo.*, is that right? That's actually not so bad; I was imagining it would require manual intervention.

So, if I move Foo.x to Foo.y, the doesFileExist check will fail, forcing us to re-scan the source tree for Foo.*, is that right?

In this case it's not doesFileExist that does the trick but getDirectoryFiles. The idea is the same though. The result of getDirectoryFiles will change, which will initiate appropriate rebuilds in the build system. As far as I understand no manual intervention or complete rebuilds will be required.

@ttuegel there will never be a requirement for manual intervention. The cost of the communication associated with a manual intervention is ridiculously high, so at the absolute worst, you can just bump the internal version number of the build system and rebuild everything - but that's the absolute worst case. Things like files moving around should be easy.

@snowleopard - I notice you only commented moduleFilesOracle. Is that the only one of relevance to this ticket? (I think commenting the other two similarly would be quite useful anyway, but just trying to figure out what to focus on.)

@ndmitchell This is because moduleFilesOracle is the key. It does most of the work and has a clear well-defined functionality.

Two other functions in this module (moduleFiles and haskellModuleFiles) feel rather accidental and I'm now thinking of refactoring them to make their functionality clearer. Note, moduleFiles is used only twice in the build system, and haskellModuleFiles is used only once! Their names are also confusing.

Let me think about a possible improvement first and then I will add more comments.

Detailed review:

 . groupBy ((==) `on` fst) $ decodedPairs

Group doesn't do a sort first, so if the arguments are [A.B, D, A.C] then you'll end up with two entries for A in the list. Was that intentional? I suspect you are after the groupSort pattern.

Only other minor point is that you are storing the list of modules as both the argument to the oracle, and in the result. The size of the oracle will increase the size of the database, which increases the time it takes to stream (although is currently < 0.2s in total, so can't be having that big an effect). You could return the modules in the order they were passed, perhaps even demanding the modules are sorted before giving them to the oracle (which also means your groupBy would catch everything) and then not have to also store the modules on the way out as well and let the user just zip.

But basically, it looks right, and about as performant as is reasonable. I checked and in the HEAD version of Shake the getDirectoryFiles pattern does only one getDirectoryContents, which is all you can really hope for.

@ndmitchell Great, many thanks for reviewing!

  • Regarding groupBy: there is an assumption that modules are sorted, I forgot to mention it. This can be checked by examining moduleFiles and haskellModuleFiles. Does this answer your comment or would you prefer to do the sort anyway, even if it may be redundant? This probably would be negligible performance wise, compared to scanning the source tree.

  • Indeed, there is a duplication of modules in arguments and results. Good point. Perhaps, we should change the top-level signature as follows?

    -- old
    moduleFilesOracle :: [Module] -> [Dir] -> [(Module, FilePath)]
    -- new
    moduleFilesOracle :: [Module] -> [Dir] -> [Maybe FilePath]

    Here Nothing means we didn't find a source file for the module. This should be both easier to understand and more efficient. This is not strictly speaking equivalent, because previously we could have returned [(A, A.hs), (A, A.x)], but this is a pathological case where we should report an error and stop the build process anyway.

@snowleopard - leaving the sort out is fine, but I would document that invariant in that area, near to where you rely on it. I might be tempted to sort anyway, since it is a trivial operation, and makes it easier to guarantee correctness.

Agreed with your new top-level structure, that makes sense.

A new description of moduleFilesOracle:

This is an important oracle whose role is to find and cache module source files. More specifically:

  • It takes a list of directories dirs and a sorted list of module names modules as arguments.
  • For each module, e.g. A.B.C, it returns a FilePath of the form dir/A/B/C.extension, such that dir belongs to dirs, and file dir/A/B/C.extension exists, or Nothing if there is no such file. If more than one matching file is found an error is raised.

For example, for the compiler package:

  • Given dirs = ["compiler/codeGen", "compiler/parser"]
  • and modules = ["CodeGen.Platform.ARM", "Lexer", "Missing.Module"], it returns
  • [Just "compiler/codeGen/CodeGen/Platform/ARM.hs", Just "compiler/parser/Lexer.x", Nothing].

👍

@ndmitchell What do you think of having oracle test like the one I added in 1136a62? It slightly interferes with the Shake database by adding a request which is not exercised during a normal build.

@snowleopard - a good idea, I don't think it will harm anythibg

Summary of recent changes:

  • Simplified functions that use moduleFilesOracle: we now have findGenerator and haskellSources with clear semantics and descriptions.
  • Context is now used as the key for moduleFilesOracle. This reduces the size of the Shake database, and allows to reliably reuse groupSort, but the disadvantage is that we can no longer supply a simple test case to the oracle as we did before. The documentation has been amended.
  • The above changes allowed to simplify call sites of findGenerator and haskellSources.

I think this corner of the build system is now much more clear, robust and efficient.

I reviewed the code, a few notes:

  • "//*hs" ?== file boils down to isSuffixOf "hs" file. Wouldn't elem (takeExtension file) [".lhs",".hs"] be a more robust and meaningful test?
  • Context is part of all the keys. Weren't we going to make Context huge by putting all the Args in it? Shouldn't the key be much smaller - e.g. just outputs? I'm surprised making Context the key makes the DB smaller.
  • In context @ Context {..} the usual whitespacing would be context@Context {..}. This might actually become relevant with explicit typed application, but it looks visually confusing as it is.
  • The let Just builder = determineBuilder src worries me, since it if goes wrong we're going to get a nasty pattern match error. Instead I'd recommend that determineBuild return a Builder (no Maybe), and if it can't find a builder then throw an error including the name of the file that it can't build. That will give us far more clues when debugging.

Overall looks much cleaner and more understandable.

@ndmitchell Many thanks!

  • Good idea about elem (takeExtension file) [".hs", ".lhs"] -- it is cleaner and more robust.
  • I switched to context@Context {..} whitespacing in all codebase.
  • We no longer have let Just builder = determineBuilder src. Instead, I do this step in findGenerator. I prefer to keep the signature determineBuilder :: FilePath -> Maybe Builder because it is more reusable (compared to throwing an error).

Regarding Context as key:

I'm surprised making Context the key makes the DB smaller.

Previously the key was the pair (dirs, modules) which was much bigger than just a Context.

Weren't we going to make Context huge by putting all the Args in it?

That's true, but I ran into problems while doing this. I might create a new issue to discuss how to proceed. In any case, here we can actually switch to using (Stage, Package) as the key, since a vanillaContext :: Stage -> Package -> Context will suffice. I'll give it a try.

@snowleopard - that all makes sense.

Is the takeExtension srcnotElem[".hs", ".lhs"] test on line 108 actually necessary? I assume not, since if we had been passing actual Haskell files it would have raised a pattern match error.

@ndmitchell You are right, it is not necessary, but it is a good optimisation: there are a lot of (l)hs files and storing them all in the Shake database and cache just to be discarded later by determineBuilder seems rather inefficient. With this optimisation we store just a handful of real generators per Context.

I think I should add a comment about this to the code.

@snowleopard That makes sense - good idea.

OK, I think I can close this now.