Oracles.ModuleFiles: performance, correctness and tests
snowleopard opened this issue · 27 comments
Oracles.ModuleFiles
is an important oracle in the build system whose role is to find and cache module source files. More specifically (also see update below):
- It takes a list of module names
modules
and a list of directoriesdirs
as arguments. - It computes a sorted list of pairs of the form
(A.B.C, dir/A/B/C.extension)
, such thatA.B.C
belongs tomodules
,dir
belongs todirs
, and filedir/A/B/C.extension
exists.
For example, for compiler
package given modules = ["CodeGen.Platform.ARM", "Lexer"]
, and dirs = ["codeGen", "parser"]
it produces [("CodeGen.Platform.ARM", "codeGen/CodeGen/Platform/ARM.hs"), ("Lexer", "parser/Lexer.x")]
.
The implementation is pretty sophisticated and currently has no automated tests: https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Oracles/ModuleFiles.hs.
Let us discuss possible correctness and performance issues here. I will also attempt to decompose the code into testable pieces and add some automated tests.
Updated definition given in #210 (comment):
This is an important oracle whose role is to find and cache module source files. More specifically:
- It takes a list of directories
dirs
and a sorted list of module namesmodules
as arguments. - For each module, e.g.
A.B.C
, it returns aFilePath
of the formdir/A/B/C.extension
, such thatdir
belongs todirs
, and filedir/A/B/C.extension
exists, orNothing
if there is no such file. If more than one matching file is found an error is raised.
For example, for the compiler
package:
- Given
dirs = ["compiler/codeGen", "compiler/parser"]
- and
modules = ["CodeGen.Platform.ARM", "Lexer", "Missing.Module"]
, it returns [Just "compiler/codeGen/CodeGen/Platform/ARM.hs", Just "compiler/parser/Lexer.x", Nothing]
.
Btw just to expand on the email I sent earlier: the issue about cabal traversing everything is this one: haskell/cabal#3019
In my case it's even worse, as in some projects I have many files and folders in my project folder which may be data generated by my program or tests, or other stuff which cabal is not supposed to look at, but I still keep in the cabal project. Then there's also issues about traversing running into (auto)mountpoints if cabal starts looking in places it isn't supposed to.
Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via build-tools
before cabal even starts looking/probing for .x or .y extensions. We'd still have the problem of probing for .lhs
/.hs
though.
It looks like a moderately complex solution to a moderately complex problem - it seems about right for the difficulty of the problem it's solving. Do you have any profile measurements that show it is a bottleneck? And I can see automated tests, namely the Travis pieces. This strikes me very much as a task that is difficult to calculate, but easy to check, since if you get it wrong nothing will work - and thus we are checking it extensively.
Only one note from reading through it - there's no point caching something that is computed by an oracle so directly. Every oracle call is cached, so you can drop the newCache bit.
In my case it's even worse, as in some projects I have many files and folders in my project folder which may be data generated by my program or tests, or other stuff which cabal is not supposed to look at, but I still keep in the cabal project.
@hvr I only traverse folders of the form dir/A/B
, where dir
is listed in hs-source-dirs
, and A.B.C
is one of the modules. It would be strange to put anything not related to sources to hs-source-dirs
, but I can see how this can lead to performance issues.
Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via build-tools before cabal even starts looking/probing for .x or .y extensions.
Making cabal
files more explicit would certainly simplify some of the work the build system is doing. On the one hand it will remove some nice magic, but on the other hand some rather unpleasant surprises (and associated complexity) will be gone too.
Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via build-tools before cabal even starts looking/probing for .x or .y extensions. We'd still have the problem of probing for .lhs/.hs though.
I think this is a good idea; the requested build-tools can be used to limit the extensions to search for. I still think we should be caching the results of this search, though. Actually, the same applies to .lhs
/.hs
, too. These facts change infrequently and computing them is slow enough that caching should provide a good speed-up.
One thing that is unclear to me: how does Shake handle cache invalidation in this case? For example, if I'm developing a package and I move Foo.x
to Foo.y
, what (if anything) is done to recognize that the pre-processor has changed?
Do you have any profile measurements that show it is a bottleneck?
@ndmitchell I don't have any profiles, but when this was implemented naively, zero builds took minutes for me. So the optimisation was a forced one.
This strikes me very much as a task that is difficult to calculate, but easy to check, since if you get it wrong nothing will work - and thus we are checking it extensively.
That's true, but I would want to have some assurance that the current solution will keep working when/if we start building new packages in future (e.g., dph
).
Only one note from reading through it - there's no point caching something that is computed by an oracle so directly. Every oracle call is cached, so you can drop the newCache bit.
Oh. Silly me. This means most newCache
calls are redundant. Thanks!
One thing that is unclear to me: how does Shake handle cache invalidation in this case? For example, if I'm developing a package and I move Foo.x to Foo.y, what (if anything) is done to recognize that the pre-processor has changed?
@ttuegel The following should happen:
- Rule
buildPackageLibrary
willneed
fileFoo.o
, because moduleFoo
is listed in thecabal
file. - Rule
compilePackage
will need fileFoo.hs
because it is listed as a dependency ofFoo.o
byghc -M
. - Rule
generatePackageCode
will be asked to buildFoo.hs
. It will call theModuleFiles
oracles that will report back the pair(Foo, "Foo.y")
. By examining the extension we will figure out thatHappy
needs to be invoked to buildFoo.hs
. We will alsoneed
the sourceFoo.y
to make sure it is up to date. - Importantly, the call to
ModuleFiles
is cached, so the results are reused by multiple build rules in the build system. However, we still need to traverse the source tree at least once, as otherwise we will be unaware of theFoo.x
toFoo.y
change. This is why zero build takes several seconds I believe.
I find that idea that A.B.C can be anywhere but src/A/B/C.hs to be a little bonkers. At everywhere I've worked there was a precise isomorphism between module name and file name, which is absolutely necessary when you have 10,000+ modules. Even end user tools were not named Main
, but used -main-is
. As a human, I find this the only way to develop software so I can find the modules. Alas, the world doesn't work that way </rant>
@ttuegel - things like doesFileExist
and getDirectoryFiles
are tracked. If they change, then the cache will be marked dirty and rerun. The downside is that it means these queries must be run on every build, to see if they change.
@snowleopard profile before optimising - I appreciate a naive implementation is insufficient, but this one might be enough. I think there are a few tweaks/simplifications that might be possible, but it's in the ballpark I was expecting.
I suggest that for each of the 3 top-level functions in this module you give a couple of representative queries/answers (edited for length), exactly like you did on this ticket. With that, someone should be able to implement it from scratch without knowing anything more, and then I might be able to give more feedback.
I suggest that for each of the 3 top-level functions in this module you give a couple of representative queries/answers (edited for length), exactly like you did on this ticket. With that, someone should be able to implement it from scratch without knowing anything more, and then I might be able to give more feedback.
@ndmitchell Sure, I will do that.
I find that idea that A.B.C can be anywhere but src/A/B/C.hs to be a little bonkers. At everywhere I've worked there was a precise isomorphism between module name and file name, which is absolutely necessary when you have 10,000+ modules. Even end user tools were not named Main, but used -main-is. As a human, I find this the only way to develop software so I can find the modules. Alas, the world doesn't work that way
things like doesFileExist and getDirectoryFiles are tracked. If they change, then the cache will be marked dirty and rerun. The downside is that it means these queries must be run on every build, to see if they change.
Having to run doesFileExist
on every module for every rebuild actually doesn't sound too bad, compared to what we're doing now in Cabal.
However, we still need to traverse the source tree at least once, as otherwise we will be unaware of the Foo.x to Foo.y change.
So, if I move Foo.x
to Foo.y
, the doesFileExist
check will fail, forcing us to re-scan the source tree for Foo.*
, is that right? That's actually not so bad; I was imagining it would require manual intervention.
So, if I move Foo.x to Foo.y, the doesFileExist check will fail, forcing us to re-scan the source tree for Foo.*, is that right?
In this case it's not doesFileExist
that does the trick but getDirectoryFiles
. The idea is the same though. The result of getDirectoryFiles
will change, which will initiate appropriate rebuilds in the build system. As far as I understand no manual intervention or complete rebuilds will be required.
@ttuegel there will never be a requirement for manual intervention. The cost of the communication associated with a manual intervention is ridiculously high, so at the absolute worst, you can just bump the internal version number of the build system and rebuild everything - but that's the absolute worst case. Things like files moving around should be easy.
@snowleopard - I notice you only commented moduleFilesOracle
. Is that the only one of relevance to this ticket? (I think commenting the other two similarly would be quite useful anyway, but just trying to figure out what to focus on.)
@ndmitchell This is because moduleFilesOracle
is the key. It does most of the work and has a clear well-defined functionality.
Two other functions in this module (moduleFiles
and haskellModuleFiles
) feel rather accidental and I'm now thinking of refactoring them to make their functionality clearer. Note, moduleFiles
is used only twice in the build system, and haskellModuleFiles
is used only once! Their names are also confusing.
Let me think about a possible improvement first and then I will add more comments.
Detailed review:
. groupBy ((==) `on` fst) $ decodedPairs
Group doesn't do a sort first, so if the arguments are [A.B, D, A.C]
then you'll end up with two entries for A
in the list. Was that intentional? I suspect you are after the groupSort pattern.
Only other minor point is that you are storing the list of modules as both the argument to the oracle, and in the result. The size of the oracle will increase the size of the database, which increases the time it takes to stream (although is currently < 0.2s in total, so can't be having that big an effect). You could return the modules in the order they were passed, perhaps even demanding the modules are sorted before giving them to the oracle (which also means your groupBy would catch everything) and then not have to also store the modules on the way out as well and let the user just zip.
But basically, it looks right, and about as performant as is reasonable. I checked and in the HEAD version of Shake the getDirectoryFiles pattern does only one getDirectoryContents, which is all you can really hope for.
@ndmitchell Great, many thanks for reviewing!
-
Regarding
groupBy
: there is an assumption thatmodules
are sorted, I forgot to mention it. This can be checked by examiningmoduleFiles
andhaskellModuleFiles
. Does this answer your comment or would you prefer to do the sort anyway, even if it may be redundant? This probably would be negligible performance wise, compared to scanning the source tree. -
Indeed, there is a duplication of modules in arguments and results. Good point. Perhaps, we should change the top-level signature as follows?
-- old moduleFilesOracle :: [Module] -> [Dir] -> [(Module, FilePath)] -- new moduleFilesOracle :: [Module] -> [Dir] -> [Maybe FilePath]
Here
Nothing
means we didn't find a source file for the module. This should be both easier to understand and more efficient. This is not strictly speaking equivalent, because previously we could have returned[(A, A.hs), (A, A.x)]
, but this is a pathological case where we should report an error and stop the build process anyway.
@snowleopard - leaving the sort out is fine, but I would document that invariant in that area, near to where you rely on it. I might be tempted to sort anyway, since it is a trivial operation, and makes it easier to guarantee correctness.
Agreed with your new top-level structure, that makes sense.
A new description of moduleFilesOracle
:
This is an important oracle whose role is to find and cache module source files. More specifically:
- It takes a list of directories
dirs
and a sorted list of module namesmodules
as arguments. - For each module, e.g.
A.B.C
, it returns aFilePath
of the formdir/A/B/C.extension
, such thatdir
belongs todirs
, and filedir/A/B/C.extension
exists, orNothing
if there is no such file. If more than one matching file is found an error is raised.
For example, for the compiler
package:
- Given
dirs = ["compiler/codeGen", "compiler/parser"]
- and
modules = ["CodeGen.Platform.ARM", "Lexer", "Missing.Module"]
, it returns [Just "compiler/codeGen/CodeGen/Platform/ARM.hs", Just "compiler/parser/Lexer.x", Nothing]
.
@ndmitchell What do you think of having oracle test like the one I added in 1136a62? It slightly interferes with the Shake database by adding a request which is not exercised during a normal build.
@snowleopard - a good idea, I don't think it will harm anythibg
Summary of recent changes:
- Simplified functions that use
moduleFilesOracle
: we now havefindGenerator
andhaskellSources
with clear semantics and descriptions. Context
is now used as the key formoduleFilesOracle
. This reduces the size of the Shake database, and allows to reliably reusegroupSort
, but the disadvantage is that we can no longer supply a simple test case to the oracle as we did before. The documentation has been amended.- The above changes allowed to simplify call sites of
findGenerator
andhaskellSources
.
I think this corner of the build system is now much more clear, robust and efficient.
I reviewed the code, a few notes:
"//*hs" ?== file
boils down toisSuffixOf "hs" file
. Wouldn'telem (takeExtension file) [".lhs",".hs"]
be a more robust and meaningful test?Context
is part of all the keys. Weren't we going to makeContext
huge by putting all theArgs
in it? Shouldn't the key be much smaller - e.g. just outputs? I'm surprised makingContext
the key makes the DB smaller.- In
context @ Context {..}
the usual whitespacing would becontext@Context {..}
. This might actually become relevant with explicit typed application, but it looks visually confusing as it is. - The
let Just builder = determineBuilder src
worries me, since it if goes wrong we're going to get a nasty pattern match error. Instead I'd recommend thatdetermineBuild
return aBuilder
(noMaybe
), and if it can't find a builder then throw an error including the name of the file that it can't build. That will give us far more clues when debugging.
Overall looks much cleaner and more understandable.
@ndmitchell Many thanks!
- Good idea about
elem (takeExtension file) [".hs", ".lhs"]
-- it is cleaner and more robust. - I switched to
context@Context {..}
whitespacing in all codebase. - We no longer have
let Just builder = determineBuilder src
. Instead, I do this step infindGenerator
. I prefer to keep the signaturedetermineBuilder :: FilePath -> Maybe Builder
because it is more reusable (compared to throwing an error).
Regarding Context
as key:
I'm surprised making
Context
the key makes the DB smaller.
Previously the key was the pair (dirs, modules)
which was much bigger than just a Context
.
Weren't we going to make
Context
huge by putting all theArgs
in it?
That's true, but I ran into problems while doing this. I might create a new issue to discuss how to proceed. In any case, here we can actually switch to using (Stage, Package)
as the key, since a vanillaContext :: Stage -> Package -> Context
will suffice. I'll give it a try.
@snowleopard - that all makes sense.
Is the takeExtension src
notElem[".hs", ".lhs"]
test on line 108 actually necessary? I assume not, since if we had been passing actual Haskell files it would have raised a pattern match error.
@ndmitchell You are right, it is not necessary, but it is a good optimisation: there are a lot of (l)hs
files and storing them all in the Shake database and cache just to be discarded later by determineBuilder
seems rather inefficient. With this optimisation we store just a handful of real generators per Context
.
I think I should add a comment about this to the code.
@snowleopard That makes sense - good idea.
OK, I think I can close this now.