Try-Scalameta

This project is about getting to know Scalameta better, both the syntactic Tree API and the Semantic model.

Syntactic Tree API

The syntactic tree model is the AST (abstract syntax tree) of parsed Scala code, without any semantic information such as symbols and types. See Tree Guide and Tree API documentation. Also see Tree examples in particular, both for learning purposes and as reference material. This syntactic model does show the syntactic structure of a Scala program, but it lacks the (semantic) data to navigate from function calls to function definitions, for example.

The syntactic model is created from Scala code without needing the Scala compiler for that. Instead, the Tree API ships with its own version of FastParse. Indeed, when using Coursier to show the dependencies of the scalameta Tree API, we get:

cs resolve org.scalameta:trees_2.13:4.4.33 -t

  Result:
└─ org.scalameta:trees_2.13:4.4.33
   ├─ org.scala-lang:scala-library:2.13.8
   ├─ org.scalameta:common_2.13:4.4.33
   │  ├─ com.lihaoyi:sourcecode_2.13:0.2.7
   │  ├─ com.thesamet.scalapb:scalapb-runtime_2.13:0.11.4
   │  │  ├─ com.google.protobuf:protobuf-java:3.15.8
   │  │  ├─ com.thesamet.scalapb:lenses_2.13:0.11.4
   │  │  │  ├─ org.scala-lang:scala-library:2.13.6 -> 2.13.8
   │  │  │  └─ org.scala-lang.modules:scala-collection-compat_2.13:2.4.4
   │  │  │     └─ org.scala-lang:scala-library:2.13.5 -> 2.13.8
   │  │  ├─ org.scala-lang:scala-library:2.13.6 -> 2.13.8
   │  │  └─ org.scala-lang.modules:scala-collection-compat_2.13:2.4.4
   │  │     └─ org.scala-lang:scala-library:2.13.5 -> 2.13.8
   │  └─ org.scala-lang:scala-library:2.13.8
   └─ org.scalameta:fastparse-v2_2.13:2.3.1
      ├─ com.lihaoyi:geny_2.13:0.6.5
      └─ com.lihaoyi:sourcecode_2.13:0.2.3 -> 0.2.7

Program ShowSourceContents explores the Tree API (as documented in the Tree API documentation). This program almost exclusively limits itself to dependencies in (just) the Tree API. Support for quasiquotes is not in the trees artifact, but in scalameta (which depends on trees). Program ShowSourceContents uses quasiquotes to a limited extent. This program also uses "scalameta contributions" (also part of the scalameta artifact), for safe tree comparisons.

The (general) Tree query API of scalameta is a bit minimal (though powerful), offering functions such as collect (or "custom traversal support" if more fine-grained control is needed than collect and friends offer). Function collect traverses matching descendant-or-self tree nodes (in "XPath terms"), so also the matching descendants of matching descendants-or-self. Object contrib.TreeOps improves on the minimal query API, but we do not need to stop there. Hence the creation of QuerySupport in this project. It is also inspired by XPath axes, but in addition offers methods to return only topmost descendant(-or-self) nodes (obeying some predicate), which is very often what is desired. Of course the scalameta Tree API also offers "specialized" query methods, that only make sense for certain kinds of nodes. Program ShowSourceContents uses both the latter specialized query methods and those in QuerySupport.

Nevertheless, program ShowSourceContents shows how we can use ad-hoc Ammonite REPL sessions that offer "custom views" of code bases, in terms of program structure, without knowing any context/ semantics. This could be very helpful in getting a grip on very large code bases, and on how their parts "hang together". Note that Ammonite should be started with "amm --thin", to avoid conflicts with Ammonite internals.

Semantic model generation

As said in the beginning, syntactic trees of Scala code are great, but they lack the semantics to do anything useful with them beyond querying/manipulating ASTs without knowing anything of the (semantical) context. In particular, these trees know nothing about symbols and types. In other words, syntactic trees are conceptually comparable to the result of the first phase of Scala compilation, which is the parser phase. At that point, only the structure of the code is known, without any context.

Fortunately, Scalameta can be used to generate a Semantic model, called SemanticDB. This model can be leveraged by multiple metaprogramming tools, which therefore are not bothered with compiler internals. That's a huge thing, benefiting Scala tooling, but also making meta-programming accessible to all of us. To use SemanticDB, no compiler instance is needed. The compiler was only involved in generating that (portable) model.

The SemanticDB model of a program can be generated by the metac tool, which leverages the Scala compiler (but outputting SemanticDB instead of class files or TASTy files). It therefore depends on the Scala compiler. The scalameta artifact contains CLIs metacp and metap, so this artifact depends on the Scala compiler (as well as on trees):

cs resolve org.scalameta:scalameta_2.13:4.4.33 -t

  Result:
└─ org.scalameta:scalameta_2.13:4.4.33
   ├─ org.scala-lang:scala-library:2.13.8
   ├─ org.scala-lang:scalap:2.13.8
   │  └─ org.scala-lang:scala-compiler:2.13.8
   │     ├─ net.java.dev.jna:jna:5.9.0
   │     ├─ org.jline:jline:3.21.0
   │     ├─ org.scala-lang:scala-library:2.13.8
   │     └─ org.scala-lang:scala-reflect:2.13.8
   │        └─ org.scala-lang:scala-library:2.13.8
   └─ org.scalameta:parsers_2.13:4.4.33
      ├─ org.scala-lang:scala-library:2.13.8
      └─ org.scalameta:trees_2.13:4.4.33
         ├─ org.scala-lang:scala-library:2.13.8
         ├─ org.scalameta:common_2.13:4.4.33
         │  ├─ com.lihaoyi:sourcecode_2.13:0.2.7
         │  ├─ com.thesamet.scalapb:scalapb-runtime_2.13:0.11.4
         │  │  ├─ com.google.protobuf:protobuf-java:3.15.8
         │  │  ├─ com.thesamet.scalapb:lenses_2.13:0.11.4
         │  │  │  ├─ org.scala-lang:scala-library:2.13.6 -> 2.13.8
         │  │  │  └─ org.scala-lang.modules:scala-collection-compat_2.13:2.4.4
         │  │  │     └─ org.scala-lang:scala-library:2.13.5 -> 2.13.8
         │  │  ├─ org.scala-lang:scala-library:2.13.6 -> 2.13.8
         │  │  └─ org.scala-lang.modules:scala-collection-compat_2.13:2.4.4
         │  │     └─ org.scala-lang:scala-library:2.13.5 -> 2.13.8
         │  └─ org.scala-lang:scala-library:2.13.8
         └─ org.scalameta:fastparse-v2_2.13:2.3.1
            ├─ com.lihaoyi:geny_2.13:0.6.5
            └─ com.lihaoyi:sourcecode_2.13:0.2.3 -> 0.2.7

How can we generate SemanticDB models for a certain Scala code base more easily? As described in Semantic model, support for SemanticDB model generation can quite easily be added to sbt projects. This support is based on enabling the semanticdb-scalac compiler plugin (not to be confused with metac). For Maven projects, SemanticDB output generation through the same compiler-plugin is also supported, as documented in the scalafix-maven-plugin github readme page.

Yet, if needed, this can also be achieved non-trivially with metac, at the low level of scalac. First recall that one way to invoke scalac (analogous to the javac Java compiler) is as follows:

scalac @/path/to/options @/path/to/sources

The metac tool can be invoked in the same way, with the same "options" and "sources" files:

metac @/path/to/options @/path/to/sources

Below it is described how such a setup can be achieved (although we would rarely need this). It is assumed that Scala 2.13 is used, both in the code base against which the metac tool is run and in the scalac and metac tools themselves. The idea is to generate the "options" and "sources" files, and then run metac using those 2 files. Let's assume the code base corresponds to artifact eu.cdevreeze.tqa:tqa_2.13:0.13.0, and that Coursier has been installed (like scalac for Scala 2.13 and metac). The needed steps are:

Generate the "options" file
- Run a command like cs fetch --classpath -E org.scala-lang:scala-library eu.cdevreeze.tqa:tqa_2.13:0.13.0
- Remove the top-level (tqa) dependency itself from the generated classpath string
- Add -cp <classpath string> to an empty "options" file, on 2 lines (one with "-cp" and one with the classpath)
- Add other options to the "options" file, for encoding, destination, compiler options, etc. (minding newlines)
Generate the "sources" file, using trivial program FindSourcePaths and saving its output
Invoke the scalac (Scala compiler) command against these "options" and "sources" files, making sure it works
Now invoke the metac command in the same way

Things may be a bit more involved than mentioned above. First of all, the "Coursier fetch" command may need to point to custom repositories and may need corresponding credentials. Secondly, it is very important to create a "closed set" of sources (as in closed under compilation, as a set of source files and dependencies).

Of course this is quite a cumbersome way to generate SemanticDB output, and not recommended in sbt projects or even Maven projects. Yet it does give a feel for how metac leverages scalac, and it can be used as fallback scenario if all else fails.

The description above about generating SemanticDB models only scratches the surface. For example, there are also tools like mtags that I currently do not know anything about.

Using SemanticDB models

Let's compare the Scalameta Tree API and the SemanticDB data model with specific phases of the Scala compiler. The syntactic Tree model can be compared to the output of the first compilation phase, namely the parser phase. At this point the compiler would know about the structure of the program, without any context. After this phase, function definitions and function calls are recognized, for example, but the compiler does not yet know how to relate them to each other. The SemanticDB model can be compared to the output of the typer phase. At that point, function calls can be related to function definitions, to the extent that the compiler can know about this. Without myself knowing anything about the Scala compiler, it seems that the phases after the typer phase mostly prepare generation of executable code (class files). Although I can also see the need for TASTy, I'm not sure where TASTy output generation fits in the almost 25 compilation phases.

It makes sense to spend some time reading the SemanticDB Specification. First it is important to get a feel for the terminology, like (typed) Tree, Type, Symbol, SymbolInformation, etc. When tree nodes have symbols attached to them, we can relate references to definitions, both having the same symbol attached to them. After a first cursory read it makes sense to read this specification in more detail, and to use it as reference material when using SemanticDB.

One relatively easy way to use SemanticDB models in static code analysis tasks is to do so via Scalafix, even if Scalafix is not used for refactoring or linting. Still, Scalafix can be handy, because it supports the SemanticDB model well in its API, and it takes many of the "bootstrapping challenges" away. See also the Scalafix API Overview. Many Scalafix key data structures clearly correspond to SemanticDB concepts, e.g. symbols and symbol information.

See for example the following line of code, which requires an implicit SemanticDocument (from the scalafix library):

val signatureOpt = tree.symbol.info.flatMap(_.signature)

So, given an implicit SemanticDocument (for the source file), we can obtain the symbol for any syntax tree (node). The symbol refers to a uniquely named type definition, function definition, etc., and is a "no-symbol" otherwise, if the tree has no associated name. So symbols associate uses of types, functions etc. with their definitions. Zooming in, from the symbol the SymbolInformation is obtained. It tells us more about the kind of symbol and provides some more details. Zooming in further, the Signature is obtained, which for classes, methods, types etc. provides details about their signature, in terms of SymbolInformation and SemanticType instances. So this gives an idea about how syntactic trees and the associated semantic information hang together. Of course, if the tree has no symbol, there is no point in trying to zoom in further for semantic information or even signatures.

It is possible to run ad-hoc Scalafix rules from source code. Scalafix will then first compile the rule and then run it. The most important downside of this approach is that such a rule implementation may not have any other dependencies than Scalafix (so it can depend only on Scalafix, Scalameta, metaconfig, and the standard Scala and Java APIs).

Before running the scalafix command from the command line, generate a classpath string. In a Maven project it can be done like so:

mvn dependency:build-classpath | grep -v INFO > ./cp.txt

Check the generated file in that it only contains a classpath string, no more, no less.

Obviously, the projects against which (semantic) Scalafix rules are run must be set up to generate SemanticDB output. Assuming that SemanticDB output has been generated (if needed), Scalafix rules can be run from (rule) source code as follows (on the command line):

scalafix --rules=file:/path/to/rule-implementation-scala-source-file \
  --config=/path/to/config-file \
  --classpath=./target/classes/meta:$(cat ./cp.txt) \
  --files=/path/to/source-directory-1-to-include \
  --files=/path/to/source-directory-2-to-include

The "classpath" setting must include the parent directory/directories of "META-INF/semanticdb", so typically of "META-INF/semanticdb/src/main/scala" (where the generated "*.scala.semanticdb" files live). The optional "files" settings can be used to control exactly which source directories are in scope as input for the scalafix rules. See for example scalafix CLI.

Hence, with a small collection of ad-hoc Scalafix rules and corresponding config files, meta-programming can be applied to large code bases, provided their builds are set up to generate SemanticDB output (if the rules are semantic rules).

It is possible to run Scalafix as sbt or as Maven commands, depending on whether the target code bases are sbt or Maven projects. That would make it easier to run Scalafix without worrying whether SemanticDB output has first been generated.

Conclusion

It is hoped that this project can help in quickly scripting some Scala code analysis, using Ammonite REPL sessions or Scalafix rules. Some of the code in this project could first be copied into those REPL sessions, or can be used for inspiration.

dvreeze/try-scalameta

Try-Scalameta

Syntactic Tree API

Semantic model generation

Using SemanticDB models

Conclusion